Fixup Initialization: Residual Learning Without Normalization
02 Feb 2019, Prathyush SPThey manage to train deep nets (10k layers!) w/o BatchNorm, by careful init scaling & initializing the 2nd residual conv to 0.
For more details, visit the source.