CPC G06F 40/263 (2020.01) [G06F 40/284 (2020.01); G06F 40/295 (2020.01); G06N 20/00 (2019.01)] | 20 Claims |
1. A method comprising:
accessing a set of messages comprising text and graphical text elements, the graphical text elements conveying an emotion;
identifying a first subset of messages from the set of messages, the first subset of messages comprising unsuitable messages;
identifying a second subset of messages from the set of messages, the second subset of messages comprising text representing an entity;
identifying a third subset of messages from the set of messages, the third subset of messages comprising superfluous characters;
removing the first subset of messages from the set of messages;
identifying text in the second subset of messages representing a named entity;
removing the identified text from the second subset of messages;
normalizing the superfluous characters within the third subset of messages;
incorporating the second and third subset of messages into a final set of messages;
converting the final set of messages into a set of training messages, the set of training messages configured for training a language detection machine-learning model;
training a classifier based on the set of training messages, the classifier having a set of features comprising a character ratio, wherein the character ratio is based on a number of non-American Standard Code for Information Interchange (ASCII) characters included in a message; and
generating a language detection model based on the trained classifier and the set of features.
|