주안점 : various shape과 distorted pattterns 때문에 irregular text OCR이 어렵다.

목표 : MORN 으로 image를 rectify 하고, ASRN이 더 쉽게 irregular text를 읽게 한다.

제안사항 : weak supervision 방식으로 학습시키는 것이고, fractional pickup method를 이용해서, ASRN의 sensitivity를 향상시킬 수 있다.

MORN 동작 방식 :

MORN learns and generates an offset-grid

The MORN first divides the image into several parts & then predicts the offset of each part.

With an input size of 32 * 100, the MORN divides the image into 3 * 11 = 33 parts.

All the offset values are activated by Tanh(), resulting in values within the range of (-1,1).

Bilinear interpolation to smoothly resize the offset maps to a size of 32 * 100

Because every value in the offset maps represents the offset from the original position, we generate a basic grid from the input image to represent the original positions of the pixels.

The basic grid is generated by normalizing the coordinate of each pixel to [-1,1].

The basic grid and the resized offset maps are summed.

Before sampling, the x-coordinate and y-coordinate on the offset maps are normalized to [0,W] and [0,H], respectively.

*advantages of the MORN

(1) The rectified images are more readable owing to the regular shape of the text and the reduced noise.

(2) The MORN is more flexible than the affine transformation.

(3) The MORN is more flexible than methods using a specific number of regressing points.

(4) The MORN does not require extra labelling information of character positions.

 

ASRN 동작 방식 :

The major sturcture of the ASRN is a CNN-BLSTM framework.

target sequence(y1, y2, ..., yN)

T : largest number of steps that the decoder generates

The decoder stops processing when it predicts an end-of-sequence token 'EOS'.

output yt

yt = Softmax(Wout * st + bout)

st : the hidden state at time step t

st = GRU(yprev, gt, st-1)

yprev : embedding vectors of the previos output yt-1

yprev = Embedding(yt-1)

gt : glimpse vectors

gt = summation(알파t,i * hi) i는 1부터 L까지

gt grabs the sequential feature vectors hi according to the various distributions of at, which is equivalent to the changes in feature areas.

hi : sequential feature vectors

L : length of the feature maps

알파t,i : vector of attention weights

알파t,i = exp(et,i) / [summation(exp(et,j) j는 1부터 L까지]

et,i = Tanh(Ws * st-1 + Wh * hi + b)

Wout, bout, Ws, Wh, b 는 trainable parameters

ASRN only uses the predicted output of the last step as yt-1 in the testing phase.

 

Fractional Pickup 동작 방식 :

matching relationship between labels and target characters in images

data-driven process

fractional pickup : fractionally picks up the neighboring features in the training phase.

An attention-based decoder trained by fractional pickup method can perceive adjacent characters.

The wider filed of attention contributes to the robustness of the MORAN.

a pair of attention weights are selected and modified at every time step.

at,k = b * at,k + (1-b) * at,k+1

at,k+1 = (1-b) * at,k + b * at,k+1

b = rand(0,1)

k = rand[1, T-1]

The randomness of b and k avoids over-fitting and contributes to the robustness of the decoder.

T : maximum number of steps

 

*FP장점 - Variation of Distribution

Fractional pickup adds randomness to at,k & at,k+1 in the decoder.

the distribution of at changes every time step in the training phase

 

*FP장점 - Shortcut of Forward Propagation

Sequential feature vector hi is the output of the last bidirectional-LSTM in the ASRN.

A shortcut connecting to step k is created by fractional pickup.

The shortcut retains some features of the previous step in the training phase, which is the interference to the forget gate in bidirectional-LSTM.

Fractional pickup provides more information about the previous step and increases the robustness for the bidirectional-LSTM in the ASRN.

 

*FP장점 - Broader Visual Field

Training with fractional pickup disturbs the decoder through the local variation of at,k and at,k+1.

(without fractional pickup) error term of sequence feature vector hk : dhk = dgt * at,k

(with fractional pickup) error term of sequence feature vector hk : dhk = dgt * (b*at,k + (1-b)*at,k+1)

dgt : the error term of glimpse vector gt

dgt is only relevant to at,k

at,k+1 is relevant to hk+1

dhk is influenced by the neighboring featues.

 

 

 

 

 

 

 

MORAN : a Multi-Object Rectified Attention Network for scene text recognition

 

(1) MORAN = MORN + ASRN

MORN = Multi-Object Rectification Network (1 kind of spatial transformer)

ASRM = Attention-based Sequence Recognition Network

 

MORN : the slanted text becomes more horizontal, tightly bounded, and easier to read

ASRM : outputs thre predicted word

 

(2) fractional pickup method

can read rotated, scaled and stretched characters in different scene texts

 

(3)

The training of the MORN is guided by the ASRN, which requires only text labels.

(Trained in a weak supervision way)

It is free of geometric constraints and can rectify images with complicated distortion

 

MORN learns and generates an offset-grid

(offset : 두 번째 주소를 만들기 위해, 기준이 되는 주소에 더해진 값을 의미, "변위" 라고도 부름, 오프셋을 이용하여 주소를 나타내는 것을 상대주소 지정방식이라고 함)

 

(4)

fractional pickup method to train the ASRN

By adopting several scales of stretch on different parts of the feautre maps, the feature areas are changed randomly at every iteration in the training phase.

Owing to training with fractional pickup, the ASRN is more robust to the variation of context.

 

 

curricuclum learning strategy for the training of the MORAN

 

ASRN = CNN-LSTM framework followed by an attention decoder

 

affine transformation network are limited by certain geometric constraints

 

Based on the predicted offsets, the image is rectified and becomes easier to recognize.

 

The MORN first divides the image into several parts & then predicts the offset of each part.

 

With an input size of 32 * 100, the MORN divides the image into 3 * 11 = 33 parts.

 

All the offset values are activated by Tanh(), resulting in values within the range of (-1,1).

 

Bilinear interpolation to smoothly resize the offset maps to a size of 32 * 100

 

Because every value in the offset maps represents the offset from the original position, we generate a basic grid from the input image to represent the original positions of the pixels.

 

The basic grid is generated by normalizing the coordinate of each pixel to [-1,1].

 

The basic grid and the resized offset maps are summed.

 

Before sampling, the x-coordinate and y-coordinate on the offset maps are normalized to [0,W] and [0,H], respectively.

 

*advantages of the MORN

(1) The rectified images are more readable owing to the regular shape of the text and the reduced noise.

(2) The MORN is more flexible than the affine transformation.

(3) The MORN is more flexible than methods using a specific number of regressing points.

(4) The MORN does not require extra labelling information of character positions.

 

The major sturcture of the ASRN is a CNN-BLSTM framework.

 

target sequence(y1, y2, ..., yN)

T : largest number of steps that the decoder generates

The decoder stops processing when it predicts an end-of-sequence token 'EOS'.

output yt

yt = Softmax(Wout * st + bout)

st : the hidden state at time step t

st = GRU(yprev, gt, st-1)

yprev : embedding vectors of the previos output yt-1

yprev = Embedding(yt-1)

gt : glimpse vectors

gt = summation(알파t,i * hi) i는 1부터 L까지

gt grabs the sequential feature vectors hi according to the various distributions of at, which is equivalent to the changes in feature areas.

hi : sequential feature vectors

L : length of the feature maps

알파t,i : vector of attention weights

알파t,i = exp(et,i) / [summation(exp(et,j) j는 1부터 L까지]

et,i = Tanh(Ws * st-1 + Wh * hi + b)

 

Wout, bout, Ws, Wh, b 는 trainable parameters

 

ASRN only uses the predicted output of the last step as yt-1 in the testing phase.

 

*Fractional Pickup

matching relationship between labels and target characters in images

data-driven process

fractional pickup : fractionally picks up the neighboring features in the training phase.

An attention-based decoder trained by fractional pickup method can perceive adjacent characters.

 

a pair of attention weights are selected and modified at every time step.

at,k = b * at,k + (1-b) * at,k+1

at,k+1 = (1-b) * at,k + b * at,k+1

 

b = rand(0,1)

k = rand[1, T-1]

The randomness of b and k avodis over-fitting and contributes to the robustness of the decoder.

T : maximum number of steps

 

*Variation of Distribution

Fractional pickup adds randomness to at,k & at,k+1 in the decoder.

the distribution of at changes every time step in the training phase

 

Shortcut of Forward Propagation

Sequential feature vector hi is the output of the last bidirectional-LSTM in the ASRN.

A shortcut connecting to step k is created by fractional pickup.

The shortcut retains some features of the previous step in the training phase, which is the interference to the forget gate in bidirectional-LSTM.

Fractional pickup provides more information about the previous step and increases the robustness for the bidirectional-LSTM in the ASRN.

 

Broader Visual Field

Training with fractional pickup disturbs the decoder through the local variation of at,k and at,k+1.

(without fractional pickup) error term of sequence feature vector hk : dhk = dgt * at,k

(with fractional pickup) error term of sequence feature vector hk : dhk = dgt * (b*at,k + (1-b)*at,k+1)

dgt : the error term of glimpse vector gt

dgt is only relevant to at,k

at,k+1 is relevant to hk+1

dhk is influenced by the neighboring featues.

 

Owing to the disturbance, back-propagated gradients are able to dynamically optimize the decoder over a broader range of neighbouring regions.

 

The MORAN trained with fractional pickup method generates a smoother at at each time step.

It extracts features not only of the target characters, but also of the foreground and background context.

The extended visual field enables the MORAN to correctly predict target characters.

"First attempt to adopt a shortcut in the training of the attention mechanism."

 

Curriculum Training

 

The MORN and ASRN can hinder each other during training.

A MORN cannot be guided to rectify images when the input images have been correctly recgnized by the high performance ASRN.

curriculum learning strategy to guide each sub-network in MORAN

 

D = {Ii, Yi}, i = 1...N

Loss = -sum(sum(log p(Yi,t | Ii ; 쎄타)

Yi,t : the ground truth of the t-th character in Ii

쎄타 : the parameters of MORAN

 

First Stage for ASRN

 

Second Stage for MORN

The ASRN trained using regular training samples is chosen to guide the MORN training.

 

Third Stage for End-to-end Optimization

 

MORAN is not designed for vertical text.

 

반응형

'스타트업 > AI' 카테고리의 다른 글

[AI] NER  (0) 2020.05.20
[AI] OCR Parsing 관련 paper 2개  (0) 2020.05.08
[AI] Graph Convolutional Network  (0) 2020.04.09
[AI] U-Net  (0) 2020.04.09
[AI] (2020년 4월 9일) Google AI 동향  (0) 2020.04.09

+ Recent posts