US 11,704,488 B2
Machine learned language modeling and identification
Vitor Rocha de Carvalho, San Diego, CA (US); Luis Carlos Dos Santos Marujo, Culver City, CA (US); and Leonardo Ribas Machado das Neves, Marina Del Rey, CA (US)
Assigned to Snap Inc., Santa Monica, CA (US)
Filed by Snap Inc., Santa Monica, CA (US)
Filed on Dec. 7, 2021, as Appl. No. 17/544,664.
Application 17/544,664 is a continuation of application No. 15/953,357, filed on Apr. 13, 2018, granted, now 11,210,467.
Claims priority of provisional application 62/485,357, filed on Apr. 13, 2017.
Prior Publication US 2022/0092261 A1, Mar. 24, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 40/263 (2020.01); G06F 40/284 (2020.01); G06F 40/295 (2020.01); G06N 20/00 (2019.01)
CPC G06F 40/263 (2020.01) [G06F 40/284 (2020.01); G06F 40/295 (2020.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
accessing a set of messages comprising text and graphical text elements, the graphical text elements conveying an emotion;
identifying a first subset of messages from the set of messages, the first subset of messages comprising unsuitable messages;
identifying a second subset of messages from the set of messages, the second subset of messages comprising text representing an entity;
identifying a third subset of messages from the set of messages, the third subset of messages comprising superfluous characters;
removing the first subset of messages from the set of messages;
identifying text in the second subset of messages representing a named entity;
removing the identified text from the second subset of messages;
normalizing the superfluous characters within the third subset of messages;
incorporating the second and third subset of messages into a final set of messages;
converting the final set of messages into a set of training messages, the set of training messages configured for training a language detection machine-learning model;
training a classifier based on the set of training messages, the classifier having a set of features comprising a character ratio, wherein the character ratio is based on a number of non-American Standard Code for Information Interchange (ASCII) characters included in a message; and
generating a language detection model based on the trained classifier and the set of features.