Fixup Initialization: Residual Learning Without Normalization

They manage to train deep nets (10k layers!) w/o BatchNorm, by careful init scaling & initializing the 2nd residual conv to 0.

For more details, visit the source.