Sunday, April 9, 2017

The High Cost of AnyVal subclasses...

The claim of a multi-paradigm language is to harmoniously serve various approaches to programming. The AnyVal subclass feature forms a strong counterargument to Scala’s multiparadigm claim.

AnyVal subclasses penalize parametric-polymorphic, type-safe programming, in order to better support type-unsafe programming styles, such as those making use of isInstanceOf. They sneakily shift the blame for their performance problems onto type safety and polymorphism. I will provide an existence proof that the blame ought to land squarely on AnyVal subclasses, but I cannot stop this blame-shifting from lending further credence to the witticism “If scala was the only language I had to think in, I’d think functional programming was a bad idea that didn’t scale, too.”

Moreover, by creating the false impression that the “newtype problem” has been solved in Scala, AnyVal subclasses obscure solutions that better serve polymorphic, type-safe programming. While I describe such a solution in this article, I have no illusions that I alone can reverse the upward trend of the AnyVal meme.

Scala, today, has the potential to better support type-safe programming, and it has since before the advent of AnyVal subclasses. In this article, we will focus on how the language could reveal this potential, becoming a better foundation for polymorphic, type-safe programming than it advertises today.

A String reference must be boxed

Suppose that you want a “wrapper” around Strings with a unique type so that they can’t be accidentally confused with arbitrary Strings. This is a common use case for a newtype, a wrapper with intentionally incompatible type that exists only at compile time. (The name “newtype” comes from the Haskell keyword for its version of this feature.)

You decide to use extends AnyVal, since you have heard that this is a compile-time-only class that doesn’t get allocated on the heap.

class Label(val str: String) extends AnyVal

object Label {
  def apply(s: String): Label =
    new Label(s)

This seems to do the trick with your first several tests.

class MyFirstTests {
  def combineLabels(l: Label, r: Label): Label =
    Label(l.str + r.str)

  def printLabels(): Unit = {
    val fst = Label("hello")
    val snd = Label("world")

As reported by javap, the new Label goes away for Label.apply.

// javap -c -cp target/scala-2.12/classes hcavsc.av.Label$

  public java.lang.String apply(java.lang.String);
       0: aload_1
       1: areturn

It vanishes for the signature of combineLabels too, meaning that we can write some functions over Labels without allocating them.

// javap -cp target/scala-2.12/classes hcavsc.av.MyFirstTests

  public java.lang.String combineLabels(java.lang.String, java.lang.String);

You can even use Label in a case class, and it will be String at runtime.

case class Labelled[A](lbl: Label, a: A)

// javap -p -cp target/scala-2.12/classes hcavsc.av.Labelled

  private final java.lang.String lbl;
  private final A a;

But then, you decide that you want a List of Labels.

// add to printLabels
val lbls = List(fst, snd)

// javap -c -cp target/scala-2.12/classes hcavsc.av.MyFirstTests

      24: iconst_2
      25: anewarray     #56                 // class hcavsc/av/Label
      28: dup
      29: iconst_0
      30: new           #56                 // class hcavsc/av/Label
      33: dup
      34: aload_1
      35: invokespecial #59                 // Method hcavsc/av/Label."<init>":(Ljava/lang/String;)V
      38: aastore
      39: dup
      40: iconst_1
      41: new           #56                 // class hcavsc/av/Label
      44: dup
      45: aload_2
      46: invokespecial #59                 // Method hcavsc/av/Label."<init>":(Ljava/lang/String;)V
      49: aastore
      50: invokevirtual #63                 // Method scala/Predef$.genericWrapArray:(Ljava/lang/Object;)Lscala/collection/mutable/WrappedArray;
      53: invokevirtual #66                 // Method scala/collection/immutable/List$.apply:(Lscala/collection/Seq;)Lscala/collection/immutable/List;

Huh. Didn’t expect those two news to be there. Ah well, maybe now that they’re in the list,{x => Label(x.str + "Aux")}

// javap -c -cp target/scala-2.12/classes hcavsc.av.MyFirstTests

  public static final java.lang.Object $anonfun$printLabels$1$adapted(java.lang.Object);
       0: new           #61                 // class hcavsc/av/Label
       3: dup
       4: aload_0
       5: checkcast     #61                 // class hcavsc/av/Label
       8: invokevirtual #117                // Method hcavsc/av/Label.str:()Ljava/lang/String;
      11: invokestatic  #119                // Method $anonfun$printLabels$1:(Ljava/lang/String;)Ljava/lang/String;
      14: invokespecial #64                 // Method hcavsc/av/Label."<init>":(Ljava/lang/String;)V
      17: areturn

OK, sure, so you took it out and put it back, so it unboxed and then boxed again. How about a tuple, instead?

// add to printLabels
(fst, snd)

// javap -c -cp target/scala-2.12/classes hcavsc.av.MyFirstTests

      73: new           #103                // class scala/Tuple2
      76: dup
      77: new           #61                 // class hcavsc/av/Label
      80: dup
      81: aload_1
      82: invokespecial #64                 // Method hcavsc/av/Label."<init>":(Ljava/lang/String;)V
      85: new           #61                 // class hcavsc/av/Label
      88: dup
      89: aload_2
      90: invokespecial #64                 // Method hcavsc/av/Label."<init>":(Ljava/lang/String;)V
      93: invokespecial #106                // Method scala/Tuple2."<init>":(Ljava/lang/Object;Ljava/lang/Object;)Vf

Two more news. Fine. How about the identity method?

// add to printLabels

// javap -c -cp target/scala-2.12/classes hcavsc.av.MyFirstTests

      97: getstatic     #59                 // Field scala/Predef$.MODULE$:Lscala/Predef$;
     100: new           #61                 // class hcavsc/av/Label
     103: dup
     104: aload_1
     105: invokespecial #64                 // Method hcavsc/av/Label."<init>":(Ljava/lang/String;)V
     108: invokevirtual #109                // Method scala/Predef$.identity:(Ljava/lang/Object;)Ljava/lang/Object;

So there seems to be an impressive collection of things that will cause an AnyVal subclass to box. You assume there’s a good reason they implemented it this way; we’ll get into that later.

No boxing with type tags

However, you decide to look for an alternative newtype mechanism that doesn’t box, under the theory that scalac’s reasons for boxing AnyVal subclasses don’t apply to the use cases you have in mind for Label and similar things in your codebase.

You have heard that Scalaz’s “type tags” are a kind of newtype with no boxing. You could just pull in scalaz-core and see if you can get them to work, but decide to implement Label directly using the same technique as Scalaz tags, instead.

object Labels {
  sealed abstract class LabelImpl {
    type T
    def apply(s: String): T
    def unwrap(lbl: T): String

  // do not forget `: LabelImpl`; it is key
  val Label: LabelImpl = new LabelImpl {
    type T = String
    override def apply(s: String) = s
    override def unwrap(lbl: T) = lbl

  type Label = Label.T

import Labels._

While regretting that the compiler no longer makes your Label type very convenient to define, you press on. First, to confirm, you can’t treat an arbitrary String as a Label:

scala> "hi there": Label
<console>:15: error: type mismatch;
 found   : String("hi there")
 required: hcavsc.subst.Labels.Label
    (which expands to)  hcavsc.subst.Labels.Label.T
       "hi there": Label

So far, so good. Then, why not retry some of the earlier experiments that caused the AnyVal-based label to box?

// javap -c -cp target/scala-2.12/classes hcavsc.subst.MyFirstTests

val fst = Label("hello")
val snd = Label("world")
      24: getstatic     #43                 // Field scala/Predef$.MODULE$:Lscala/Predef$;
      27: aload_1
      28: invokevirtual #47                 // Method scala/Predef$.identity:(Ljava/lang/Object;)Ljava/lang/Object;

(fst, snd)
      32: new           #49                 // class scala/Tuple2
      35: dup
      36: aload_1
      37: aload_2
      38: invokespecial #53                 // Method scala/Tuple2."<init>":(Ljava/lang/Object;Ljava/lang/Object;)V

val lbls = List(fst, snd)
      48: iconst_2
      49: anewarray     #4                  // class java/lang/Object
      52: dup
      53: iconst_0
      54: aload_1
      55: aastore
      56: dup
      57: iconst_1
      58: aload_2
      59: aastore
      60: invokevirtual #62                 // Method scala/Predef$.genericWrapArray:(Ljava/lang/Object;)Lscala/collection/mutable/WrappedArray;
      63: invokevirtual #65                 // Method scala/collection/immutable/List$.apply:(Lscala/collection/Seq;)Lscala/collection/immutable/List;{x => Label(Label.unwrap(x) + "Aux")}
  public static final java.lang.Object $anonfun$printLabels$1(java.lang.Object);
       0: getstatic     #26                 // Field hcavsc/subst/Labels$.MODULE$:Lhcavsc/subst/Labels$;
       3: invokevirtual #30                 // Method hcavsc/subst/Labels$.Label:()Lhcavsc/subst/Labels$LabelImpl;
       6: new           #104                // class java/lang/StringBuilder
       9: dup
      10: invokespecial #106                // Method java/lang/StringBuilder."<init>":()V
      13: getstatic     #26                 // Field hcavsc/subst/Labels$.MODULE$:Lhcavsc/subst/Labels$;
      16: invokevirtual #30                 // Method hcavsc/subst/Labels$.Label:()Lhcavsc/subst/Labels$LabelImpl;
      19: aload_0
      20: invokevirtual #110                // Method hcavsc/subst/Labels$LabelImpl.unwrap:(Ljava/lang/Object;)Ljava/lang/String;
      23: invokevirtual #114                // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      26: ldc           #116                // String Aux
      28: invokevirtual #114                // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      31: invokevirtual #120                // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
      34: invokevirtual #36                 // Method hcavsc/subst/Labels$LabelImpl.apply:(Ljava/lang/String;)Ljava/lang/Object;
      37: areturn

No allocation there. Hmm. Well, maybe our concrete LabelImpl instance is doing some secret boxing?

// javap -c -cp target/scala-2.12/classes 'hcavsc.subst.Labels$$anon$1'

  public java.lang.String apply(java.lang.String);
       0: aload_1
       1: areturn

  public java.lang.Object apply(java.lang.String);
       0: aload_0
       1: aload_1
       2: invokevirtual #27                 // Method apply:(Ljava/lang/String;)Ljava/lang/String;
       5: areturn

  public java.lang.String unwrap(java.lang.String);
       0: aload_1
       1: areturn

  public java.lang.String unwrap(java.lang.Object);
       0: aload_0
       1: aload_1
       2: checkcast     #21                 // class java/lang/String
       5: invokevirtual #23                 // Method unwrap:(Ljava/lang/String;)Ljava/lang/String;
       8: areturn

No boxing there. That makes sense; in that context, Label is String; the fact that our Label-using code doesn’t know that is irrelevant, because we hid that information using existential types.

So, it is possible to have a newtype mechanism that doesn’t box. You don’t have to wait for the JVM to deliver its own brand of value types; you can even implement it yourself, in Scala, today. They must have had another reason for all this boxing, because “we have to because JVM” is denied by the behavior of Scala-JVM itself.

You aren’t sure what those reasons are, but you decide to port the rest of your code to use the existential Label. Befitting an unboxed newtype, the runtime representation of List[Label] is exactly the same as the underlying List[String], as well as every Option, Either, and whatever else you can think up.

You notice that the erasure for Label is different, but this seems significantly less serious than the boxing problem, so leave it for now. (We will dig into related design decisions later.)

What can you do with a box? What can you do without a box?

Let’s start with a quick comparison of boxing AnyVal and the “type tagging” mechanism we’ve just seen.

Capability AnyVal subclass Type tag
Defining methods normal override; virtual method dispatch available implicit class enrichment only
lbl.getClass Label String
Cast Any to Label checked at runtime unchecked; no wrapper left at runtime
isInstanceOf checked at runtime unchecked; same recent casting doesn’t work
Adding type parameters to methods boxing/unbox penalty no boxing penalty
Wrapping a List O(n): box every element and reallocate list itself O(1), with subst: no allocation, output list eq to input list
Unwrapping a list O(n): reallocate list, unbox each element O(1): eq output with subst. Also possible to make unwrapping a <: (free liftable automatic upcast)
Coinductive type class instances works; boxing penalty applies works; no boxing penalty
Wrapping whole program parts each function must be wrapped to add per-value wrapping/unwrapping O(1): just works with subst

I detect from this matrix a particular theme: AnyVal subclasses give up a lot of capability in the type-safe arena. Consider rewriting a loop that uses Label as state as a foldLeft: you must contend with a new boxing/unboxing penalty, since the state parameter in a foldLeft is type-parametric. It’s more fodder for the persistent higher-order function skeptics among us.

While we know that adding type parameters to our functions improves type-safety, the skeptic will note the boxing penalty, and attribute it to parametric polymorphism. But we know the true culprit.

If AnyVal subclassing taxes type-safe programming in these ways, what is it spending the money on? Simple: support for isInstanceOf, “safe” casting, implementing interfaces, overriding AnyRef methods like toString, and the like.

As type-safe, parametrically-polymorphic programmers, we avoid these features, as a matter of principle and of practice. Some, like checked casting, are simply not type-safe. Some ruin free theorems, like toString, and we would prefer safe mechanisms, like the Show typeclass, to actually tell us at compile time if our programs make sense. Yet, if we use AnyVal subclasses, we have to pay the price for all the programmers that wish to write type-unsafe code, like List[Any] => List[Label]. All is not well in Multiparadigmatic Land.

When will our methods be resolved?

To showcase the relationship of the two approaches to runtime-reflective programming versus statically-proven programming, let’s consider stringification.

Scala provides the toString virtual method on Any. Calling this method is dynamically resolved on the value itself; it is as if every value must carry around a pointer to a function that, given itself, returns a String. We can define this for our original AnyVal-based Label, and so toString on List et al will also work.

// add to class Label
  override def toString = s"Label($str)"

scala> List(fst, snd).toString
res0: String = List(Label(hello), Label(world))

scala> Some(fst).toString
res1: String = Some(Label(hello))

Moreover, this “works” even for the type List[Any].

scala> List[Any](fst, "hi").toString
res2: String = List(Label(hello), hi)

You cannot override toString for our fully-erased Label. After all, every Label is just a String at runtime! (Different types, same class.)

However, the type-safe programmer will recognize List[Any] as a type that, if it occurs in her program, means “something has gone wrong with this program”. Moreover, because toString doesn’t make sense for all types, we use a static mechanism, like the scalaz.Show typeclass. And this works fine for Label, because it is statically resolved by type, not dependent on an implicit runtime member of every Label; in fact, it can only work because it is static!

// add to object Labels
  import scalaz.Show

  implicit val showLabel: Show[Label] =
    Show shows {lbl =>

scala> import, scalaz.std.list._,

scala> List(fst, snd).shows
res1: String = [Label(hello),Label(world)]

scala> some(fst).shows
res2: String = Some(Label(hello))

So if you are doing this kind of programming, it doesn’t matter whether you can’t override toString, or type test, &c; you weren’t doing it anyway. But, aside from a little performance bump, what do you gain from unboxed type-tagging?

When is a Label a String? When is it not?

You notice that subst is at the foundation of several Scalaz constructs like Leibniz and Liskov, and plays a prominent role in the Tag API as well. You decide to add this to your LabelImpl as well.

// in LabelImpl
def subst[F[_]](fs: F[String]): F[T]

// and in val Label
override def subst[F[_]](fs: F[String]) = fs

It’s interesting that you can use this to tag a whole List[String] in constant time:

scala> val taggedList = Label.subst(List("hello", "world"))
taggedList: List[Label.T] = List(hello, world)

It’s also interesting that you can use this to untag a whole list in constant time.

scala> Label.subst[Lambda[x => List[x] => List[String]]](identity)(taggedList)
res0: List[String] = List(hello, world)

Functions and typeclass instance can be tagged or untagged, too.

scala> Label.subst[Lambda[x => (x, Int) => x]](_ substring _)
res1: (Label.T, Int) => Label.T = $$Lambda$3194/964109489@72557d64

scala> import scalaz.Monoid, scalaz.std.string._

scala> Label.subst(Monoid[String])
res3: scalaz.Monoid[Label.T] = scalaz.std.StringInstances$stringInstance$@252798fe

All of this works because subst is really evidence that, deep down, String and Label are the same.

scala> import scalaz.Leibniz, Leibniz.{===, refl}

scala> Label.subst[String === ?](refl)
res4: Leibniz[Nothing,Any,String,Label.T] = scalaz.Leibniz$$anon$2@702af12c

Yet, you ran an experiment earlier to prove that you can’t confuse String and Label; indeed, this still holds true, despite the presence of subst!

scala> "still a string": Label
<console>:21: error: type mismatch;
 found   : String("still a string")
 required: hcavsc.subst.Labels.Label
    (which expands to)  hcavsc.subst.Labels.Label.T
       "still a string": Label

scala> Label("still a label"): String
<console>:21: error: type mismatch;
 found   : hcavsc.subst.Labels.Label.T
 required: String
       Label("still a label"): String

Here’s what’s happening: in a sense, (new Label(_)): (String => Label) and (_.str): (Label => String) witness that there’s a conversion between the two types. subst witnesses that there’s identical runtime representation between its own two types. You get to selectively reveal this evidence when it makes writing your program more convenient; the rest of the time, it is hidden.

But I would like to step one level up: this is a design space, and subst as we have seen it isn’t appropriate for all designs. As the author of your own abstract newtypes, you get to choose how much, if any, of this underlying type equality to reveal.

If subst is the right choice

For various reasons, the above is how Scalaz Tag (@@) is defined. If you wish these semantics, you might as well throw everything else away and write

sealed trait LabelTag // no instances
type Label = String @@ LabelTag
val Label = Tag.of[LabelTag]

and take advantage of the convenient tools around subst defined in Tag.Of. But it’s not the only choice! It’s one point in the design space. To do right by your API users, it’s worth exploring that design space a little more.

Type-unsafe code isn’t type-safe

Unboxed existential tagging spreads through your codebase. You feel free to apply it liberally, because you know you aren’t paying the wrapping costs of AnyVal subclasses; all these new abstraction layers are pure type-level, and fully erased.

You receive a “bug report” from a fellow developer that this expression never seems to filter out the non-label Strings.

(xs: List[Any]).collect{case t: Label => t}
<console>:16: warning: abstract type pattern
 (the underlying of hcavsc.translucent.Labels.Label)
 is unchecked since it is eliminated by erasure
       (xs: List[Any]).collect{case t: Label => t}
<console>:16: warning: The outer reference
 in this type test cannot be checked at run time.
       (xs: List[Any]).collect{case t: Label => t}

Your mind on safe pattern matching practice, you add def unapply(s: String): Option[T] to LabelImpl and counsel preference for the form case Label(t) => ..., as well as to not ignore -unchecked warnings.

You get another bug report that this always seems to succeed.

(s: String).asInstanceOf[Label]

Repeating your advice about warnings, you start to wonder, “where is this kind of code coming from?”

Someone else complains that they want to make T extends Ordered[T], and can’t fathom where the code should go. You advise the static approach of implementing the Ordering typeclass instance instead for T, wonder how deep the object-orientation hole goes, and forward the link about the typeclass pattern again, too.

Suppose you went back to AnyVal

We’ve seen that AnyVal subclasses could have been incredibly cheap, but aren’t, so as to support “features” like checked casting. Who’s going to foot the bill?

  1. Oh, this allocates when passing through polymorphic contexts, but not monomorphic ones? Avoid polymorphic code.
  2. Oh, this extra type-safety adds all this allocation? Type safety is expensive at runtime; we need to stick to String.
  3. We can’t do any better; the JVM limits the possibilities. You have to pay for runtime wrapping if you want a wrapper type.

In this article, I have demonstrated that none of these conclusions are correct. However, only a tiny minority of Scala practitioners will ever read this article, and I will not blame the rest for drawing these seemingly straightforward inferences, ultimately faulty as they are.

The real cost of AnyVal subclasses is not all the needless memory allocation. The real cost is the damage to the practice of type-safe programming in Scala. It’s in all the curious developers who sought to add a little more type safety to their programs, only to find themselves penalized by the runtime, once bitten. It’s in the reinforcement of this attitude towards abstraction that they’ll continue to carry with them, the next time an opportunity presents itself. It’s a missed opportunity for pure type-level thinking, all so that asInstanceOf “works”.

See “…and the glorious subst to come” for further development of the ideas in this article.

This article was tested with Scala 2.12.1, Scalaz 7.2.10, and Kind Projector 0.9.3. The code is available in compilable form for your own experiments via Bazaar.


Vitor said...

Great post! I released a small project months ago that generates valueclasses and newtypes in compile time according to any given type. I'd love your feedback on that.

Unknown said...

This is a thoughtful post. I too shuttered when I saw the introduction of AnyVal.

However, do we place too much blame on Scala? Scala is trying to work around the limitations of the JVM; in my understanding, there is not a cheap way to do value classes in the JVM.

It does seem like Scala could be more consistent... IE - always erase the value class, rather than have special rules in different circumstances. And, throw a fat compiler warning when matching on an AnyVal value class? Would this help?

Kevin Meredith said...

"The real cost is the damage to the practice of type-safe programming in Scala."

Is this "cost" hidden to the developer using Value Classes? If not, i.e. it's a direct downside, could you please demonstrate the loss of type safety with AnyVal?

Stephen Compall said...

As part of this post, I demonstrate an alternative implementation as you describe, "always erase the value class", and moreover, that this has nothing to do with JVM limitations; there is no need to wait for a JVM "value class" mechanism such as is in the pipeline.

Stephen Compall said...

The loss is in that people are discouraged from using type-safe mechanisms. Folks are more likely to do "stringly-typed" programming if it appears that type-safe alternatives have significant runtime costs.

Lachlan O'Dea said...

There is a downside to tagged types though: they will *always* box value types. AnyVal can avoid boxing of value types in many situations. Unfortunately there does not seem to be a way to get the best of both worlds.

Stephen Compall said...

See section "Boxing Ints" of the followup article, . In short, you can get a tagged Int to erase to int, &c.

The need for an upper bound is not ideal, as discussed in section "T = String translucency" of the same article.

Nicolas Rouquette said...

Thanks for a very informative explanation about the pros/cons of AnyVal vs. type tags and of some of the pitfalls involved. Have you considered exploring applying type tags to Spark?

Stephen Compall said...

I haven't considered any specific library, since they are a very general-purpose tool, particularly in codebases that prefer type-safety.

I think it's an interesting exercise to pick out one informality (something represented by String or Int or some such that really deserves its own type), refine it with a type tag to formalize the abstraction, and see what happens when you try to get it to compile. Because "be careful" is just a joke strategy, you'll likely find several bugs where unverified strings/ints/&c "leaked" into the informal stringly "type". Figuring out what to do then can be a real challenge, I think.

So if you're a fan of Spark, you're far better placed to experiment with type tags therein than I am. Let me know how it goes :)

Nicolas Rouquette said...

Ok; I will keep you posted.

Justin said...

Excellent article; I hope you can answer a follow up. Unfortunately `toString` is not going away. I'd like to use type tags to protect sensitive information (most importantly, making it hard to log that info accidently). That means overriding `toString`. Is there any way type tags can help or am I stuck with `AnyVal`? (I don't have much hope but ...)


Stephen Compall said...

Not as long as you accept the premise that you must use `toString`. Its presence doesn't imply a requirement to use it; Scalaz's `Show` typeclass is one example of an alternative, type-safe approach, but is not any more a one-size-fits-all than toString is.

As long as you recognize that dynamic, unchecked stringification isn't some law of physics, but a deliberate choice that cuts you off from other possibilities (e.g. a logging system may certainly reject printing values that don't meet some particular typeclass constraint, and that includes wrappers you may write for unsafe logging systems), that's half the battle.

Stephen Compall said...

I didn't recall this while writing the article, but there's a good chance I internalized the basic idea of how to do this from this scala-user post by Julian Michael (@julianmichael on github).

So, to be safe, assume I invented none of this, only poked around with the bytecode/JS output for a while, and got all the real ideas from OCaml and/or Julian's post.

About Me

My photo

I am S11001001, s11 for short.  Programmer and Free Software enthusiast.

Search for my name to see more stuff about me; no one shares my real name, and no one shares my username, though I can't understand why.