Post-OCR parsing : building simple and robust parser via BIO tagging

 

(Step 1) OCR 결과(텍스트 읽은 내용과 좌표 정보, text segments and their coordinates in images)

(Step 2) Serialization (좌표 정보를 이용해서, 텍스트를 Serialize 시킨다.)

(Step 3) Serialized segment 를 BIO-tagged 시킨다.

(Step 4) Segment들을 그룹화 결합시킨다.

 

(Step 1) "STAR" : [x=2, y=1], "HUGS" : [x=3, y=1]

(Step 2) "STAR", "HUGS", ... , "volcano", "x4", "@1024", "4048", "iced", "coffee", ...

(Step 3) store.nm_B, store.nm_I, ... , nm_B, cnt_B, uprice_B, price_B, nm_I, nm_I, ...

                  grp_B,        grp_I, ... , grp_B, grp_I,    grp_I,     grp_I, grp_I, grp_I, ...

(Step 4) store.nm : "STAR HUGS COFFEE",

            ... ,

            menu : {

                        nm : "volcano iced coffee",

                        cnt : "4",

                        unitprice : "1024",

                        price : "4048"

                      },

 

(Step 1)

(Step 2)

(Step 3) token-embedding (BERT에서 썼던 embedding)

           segment-embedding (BERT에서 썼던 embedding)

           position-embedding (BERT에서 썼던 embedding)

           coordinate-embedding : represent the spatial information of visually embedded text segments (normalized)

           line group-embedding : Serialization 프로세스 중에 embedding line number를 만들어낼 수 있다.

(Step 4) B and I of other fields can be considered as O-tag.

            In receipt parsing task, there is an additional group tag to reflect the hierarchical structure of parses.(for example fields such as name, count, and price are grouped together based on the item they represent)

(Step 5) The refinement process typically involves database match and string conversion using a regular expression. In receipt parsing task, (1) various special symbols in "cnt" and "price" values, and (2) the thousands separator in "price" are refined to have unified representation. In namecard parsing task, a person's first name and family name are distinguished by using the name database and the output formats for phone and fax numbers are unified.

 

 

The baseline model using fine-tuned multi-lingual BERT.

 

반응형

'스타트업 > AI' 카테고리의 다른 글

[AI] NLP Task  (0) 2020.08.05
[AI] BERT  (0) 2020.08.04
[AI] GloVe  (0) 2020.08.04
[AI] Hue(Hadoop User Experience)  (0) 2020.07.30
[AI] pytorch shared memory error  (0) 2020.07.22

+ Recent posts