[笔记] Java 泛型擦除，务实的妥协

vlambda
2020-07-21

[笔记] Java 泛型擦除，务实的妥协

原文 erasure / Brian Goetz / 2020.06 讲述了为什么在 2004 年为 Java 添加泛型时选择了类型擦除是一种明智而务实的选择.

擦除并非是一种邪恶丑陋的手段

擦除（ erasure ）本质上是一种将代码从一个 ( 较丰富类型的 richer types ) 级别转换到另一个 ( 较不丰富类型的 less rich types ) 较低级别的工具. 例如从高级语言到中间表示（IR：Intermediate representation）再到特定于平台的代码（native code）最后到硬件指令. 显然较低级别的层所能提供的类型抽象总是更简单、更弱的. 例如文章中所提到的两个例子：

Java 字节码集提供了一些特殊的指令：从栈移动整数到本地变量集（iload、istore），浮点数也有类似的指令（fload、fstore），但为了不将虚拟调度的语义引入到 x86 指令集或是在寄存器中模拟 Java 的原始类型集，没有针对字节、短整数、字符和布尔值的指令，因为它们会被编译器擦除至整数类型从而使用与整数类型相同的指令;
将 C 语言编译为平台代码时，有符号和无符号的整数都会被擦除到通用寄存器中，因为没有单独的 有符号寄存器 或 无符号寄存器.

同质、异质翻译

注：这一段需要对编译原理有比较深的认识，笔者才疏学浅，仅能理解一部分，有需要的请阅读原文及其中关于翻译策略的这篇论文：Two Ways to Bake Your Pizza —— Translating Parameterised Types into Java.

简单来说，同质翻译与异质翻译很大的不同点在于：是否应该将某个类型的泛型表达视为一个独立的实体并产生到独立的构件.

文中提到了 Java (同质翻译) 和 C++ (异质翻译) 的两个例子：

在 Java 中，Foo<T>、Foo<String>、Foo<Integer> 会被转换为同一个构件 Foo.class.
在 C++ 中，模板的不同实例是完全不同的类型，具有不同的语义和生成的代码，vector<int> 和 vector<float> 是两个构件. 异质翻译的情况下，类型安全得到了显著的提升，某些情况下还可以针对某种泛型实例采用特别的优化手段 ( 毕竟是不同的构件了 ) , 但是相应的，更多的构件占用了更多的空间. (文中提到了一种占用空间成本的极端情况：Scala 的 @specialized 注解会使编译器主动为原始类型产生各泛型的构件，导致了可以通过几行代码轻松生成一个 100 M 的 JAR 文件).

Scala 本身是运行在 JVM 上的一种语言，这里其实也体现了前面说的 擦除是一种将代码从一个级别转换到另一个较低级别的工具，这里通过 @specialized 注解使得 Scala 的编译器在源代码个级别上使用异质翻译策略而不是 Java 同质翻译. 因此，不能说所有 JVM 语言都采用同质翻译.

Java 采用泛型擦除的理由

文中总结了需要使用类型化的类型参数信息的场景：

反射：想知道 List<T> 于此时的 T 指的是哪种类型;
Layout or API specialization. （不是很能理解这一点 In a language with primitive types or inline classes, it might be nice to flatten the layout of a Pair<int, int> to hold two ints, rather than two references to boxed objects.

运行时类型检查：

   
     
     
   
    
      
      
    

    
      
      
      List<String> list = new ArrayList<>();
    
      
      
      List genericList = list;
    
      
      
      genericList.add(1);

上面的代码会导致堆污染，因此最好能在污染发生前捕捉到此问题.

之所以说泛型擦除是一种妥协，是因为需要考虑到当时 (Java) 的目标，前面三种场景所代表的三种目标 (程序员的便利性、性能与安全) 的优先级和限制以及替代方案.

Java 的目标：逐步兼容性

JDK 1.0 是 1996 年发布的，而泛型是 2004 年的 JDK5 中加入的，这八年时间里的老代码显然是不可能为泛型这个目标让路的，因此 Java 泛型的目标是：

It must be possible to evolve an existing non-generic class to be generic in a binary-compatible and source-compatible manner.

这意味着所有的老代码中的 ArrayList 可以自由选择是否立即、以后或永不进行泛型化，这避免了对老代码的投资在某一段时间 (标志日 flag days) 化为乌有 (重新编译世界上所有的代码或锁死 JDK 版本停留在 JDK 1.4 ).

文章中还提到了 Java 的一个基本设计：Java 是单独编译并动态链接的. 这被认为是 Java 最大的优势之一：可以编译依赖了版本 1 的 B 的 A 代码，并运行在版本 2 的 B 上却不需要重新编译 A. 笔者：这即是热重载的基石，也是代码上线后才发现某个依赖存在版本冲突的源头 😂.

堆污染

文中提出了一个例子：

 
   
   
 
  
    
    
  

  
    
    
  Box<String> bs = new Box<>("hi!");   // safe
  
    
    
  Box<?> bq = bs;                      // safe, via subtyping
  
    
    
  Box<Integer> bi = (Box<Integer>) bq; // unchecked cast -- warning issued
  
    
    
  Integer i = bi.get();                // ClassCastException in synthetic cast to Integer

在同质翻译下，只有当我们执行到第四步才会真的获得一个错误，此时就发生了堆污染.

在现代 IDE 中，基本上第三步就会收到一个警告.

文章提到了一个简单的规则用以帮助确认永远不会产生堆污染：

If a program compiles with no unchecked or raw warnings, the synthetic casts inserted by the compiler will never fail.

笔者：这同时告诫我们，IDE 的警告不应该随意关闭、忽略.

JVM 的实现、语言的生态

在 Oracle 官网中，Java 语言和 JVM 是分别两套规范，Java 编译器会将代码编译为 JVM 所能执行的类文件，但 JVM 其实乐于运行任何符合规范的类文件，而不管这个文件的源语言是与 Java 相似的 Scala、Kotlin 还是某些语言的移植实现 ( JRuby、Jython、Jaskell ).

文章给出的数据是：从某种程度上讲，有 200 多种使用 JVM 作为编译目标的语言. 而 JVM 有十几种商业发行版.

因此回到 2004 年，虽然当时在技术上可以直接向 JVM 添加泛型支持，但这不仅是一笔重大的工程投资，还要求各衍生语言、JVM 实现者之间达成一个协调. 而正如前面提到的 Scala，它采用了不同于 Java 的异质翻译.

擦除是务实的妥协

综上所述，在技术与生态方面的限制促使了 Java 选择同质翻译策略:

Runtime costs. A heterogeneous translation entails all sorts of runtime costs: greater static and dynamic footprint, greater class-loading costs, greater JIT costs and code cache pressure, etc. This might have put developers in a position where they had to choose between type-safety and performance.

Migration compatibility. There was no known translation scheme at the time that would have allowed a migration to reified generics to be source- and binary-compatible, creating flag days and invalidating developer’s considerable investment in their existing code.

Runtime costs, bonus edition. If reification is interpreted as checking types at runtime (just as stores into Java’s covariant arrays are dynamically checked), this would have a significant runtime impact, as the JVM would have to perform generic subtyping checks at runtime on every field or array element store, using the language’s generic type system. (This might sound easy and cheap when the type is something simple like List<String>, but can quickly get expensive when they are something like Map<? extends List<? super Foo>>, ? super Set<? extends Bar>>. (In fact, later research cast doubt on the decidability of generic subtyping](https://www.cis.upenn.edu/~bcpierce/papers/variance.pdf)).

JVM ecosystem. Getting a dozen JVM vendors to agree on if, and how, type information would be reified at runtime was a highly questionable proposition.

Delivery pragmatics. Even if it were possible to get a dozen JVM vendors to agree on a scheme that could actually work, it would have greatly increased the complexity, timeline, and risk of an already substantial and risky effort.

Language ecosystem. Languages like Scala might not have been happy to have Java’s invariant generics burned into the semantics of the JVM. Agreeing on a set of acceptable cross-language semantics for generics in the JVM would again have increased the complexity, timetable, and risk of an already substantial and risky effort.

Users would have to deal with erasure (and therefore heap pollution) anyway. Even if type information could be preserved at runtime, there would always be dusty classfiles that were compiled before the class was generified, so there would still be the possibility that any given ArrayList in the heap had no type information attached, with the attendant risk of heap pollution.

Certain useful idioms would have been inexpressible. Existing generic code will occasionally resort to unchecked casts when it knows something about runtime types that the compiler does not, and there is no easy way to express it in the generic type system; many of these techniques would have been impossible with reified generics, meaning that they would have to have been expressed in a different, and often far more expensive, way.

链接、参考资料

erasure / Brian Goetz / 2020.06
https://cr.openjdk.java.net/~briangoetz/valhalla/erasure.html
Two Ways to Bake Your Pizza —— Translating Parameterised Types into Java http://pizzacompiler.sourceforge.net/doc/pizza-translation.pdf

vlambda博客
学习文章列表