Systematic Review of Semantic Analysis Methods

NLP (natural language processing) is a very broad subcategory of computer science that focuses on human language processing in computers. Each different NLP processing technique focuses on different parts of linguistics, with semantics being the main focus of this manuscript. This manuscript aims to utilize a systematic review of several published papers ranging from 1998 to the current day to review and summarize critical analysis regarding the three main models that this paper focuses on: Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA), and Neural Network based models. This manuscript compares these models against each other, weighs their advantages and disadvantages, and provides their uses. The selected studies in this review were analyzed and examined to ensure that they meet the quality and standards of the proposed research methodology. The results show that Neural network-based solutions are the most popular Semantic Analysis model in Academia (doubling the number of results of ESA and LSA combined), and they are usually the best in most tasks. However, there are specific scenarios and circumstances in which relying on the older LSA and ESA models could be beneficial.


Introduction
Natural language processing (NLP) is a broad field of computer science that deals with the interaction between computers and humans.It involves various techniques to analyze, understand, and generate human language (William et al., 2023).NLP involves several components, such as morphological, syntactic, semantic, and pragmatic analyses.
Morphological analysis studies words, their formation, and their relation to other words.Syntactic analysis involves analyzing the structure and grammar of sentences.Semantic analysis deals with understanding the meaning of words and sentences.Pragmatic analysis deals with interpreting language in context and how it conveys meaning.These different components of NLP work together to enable computers to understand and process human language (Yousif J., 2013).The complexity of the techniques used in NLP can range from trying to find the grammar of a sentence to trying to summarize the meaning and intent of a conversation between several people (Khurana et al., 2023).
Computational NLP has been around for a very long time, starting in the 1930s when the first translation machine was patented and invented.
Many technological advancements were used to process text throughout the years, relying on concepts such as Finite State Machines (FSMs) (Yousif & Sembok, 2006a), Hidden Markov Models (HMMs) (Iftikhar et al., 2023;Yousif, J.,2019), and Neural Networks (Yousif & Sembok, 2006b).These concepts were used to create many different processing techniques focused on five essential linguistics' components (phonology, morphology, syntax, pragmatics, and semantics).However, the component that stands out from the rest is semantics since it represents the meaning and intent of a piece of text.Meanings of text evolve from one time-period to another, and text meanings can entirely change when evaluated based on different levels of parsing: word, phrase, sentence, and paragraph.This means that evaluating semantics is essential when accurately processing text for purposes such as translation, information retrieval, text summarization, and even chatbots, which have exploded in popularity with the rise of language models such as ChatGPT (Kim et al., 2023).
It is not a surprise that the semantics analysis software market and interest in semantic analysis in academia have had steady growth over the last two decades.Figure 1 shows a changing rate of +15% every year in the Sentiment Analysis Software Market, reaching 7 billion dollars of estimated revenue in 2031 (Business R., 2023).It can also be seen that Sentiment Analysis in Academia has been steadily growing from 2001 until 2021 when there was a significant decrease.Figure 2 presents the results of searching the "Semantic Analysis" concept in Google Scholar, which shows that the concept reached 240000 times that appeared in research papers in 2020.Therefore, identifying the research gaps related to semantic analysis would require a critical review of the existing literature and analysis of the field's current state.However, some possible research gaps include improving the accuracy and efficiency of the analysis and investigating the ethical implications of using semantic analysis in various contexts.In addition, a deeper understanding of context in natural language, resolving semantic ambiguities, and exploring multilingual semantic analysis are other major challenges.Improving entity recognition, understanding negation and uncertainty, and advancing semantic parsing for complex queries are also critical.Researchers should focus on interpretability, handling figurative language, and dynamic learning from user feedback.Cross-modal semantic analysis could provide a more holistic understanding of semantics.
This paper reviews and summarizes the critical analysis of three main models: Latent Semantic Analysis (LSA), Explicit Semantic Analysis (ESA), and Neural Network-based models.It compares their advantages and disadvantages and describes their uses.The selected studies meet quality standards and are analyzed using systematic methodology.

Literature Survey
Many studies discussed and implemented semantic analysis as a crucial factor in developing NLP applications.
Evangelopoulos et al. ran a review that introduces and explains the functionality of LSA (Latent Semantic Analysis) and its uses and weaknesses (Evangelopoulos, N., 2013).They found that LSA is a very flexible technique, and it was used in various applications such as word meaning, memory, and speech coherence.However, it has a major flaw: its disregard for sentence-level individual document meaning resulting from word order.Gupta et al. conducted an indepth review of the LSA method and its applications (Gupta et al., 2022).They found that it was very good for general topic modeling tasks employing a wide variety of information such as newspapers, press releases, tweets, and articles.However, it is very data-intensive and performs poorly when the archives are inadequate or incomplete.Shaik et al. reviewed semantic analysis methods in the educational field.They discussed many methods, including LSA, LDA (Linear Discriminant Analysis), PLSA (Probabilistic LSA), and Neural Network-based methods.Out of their many results, one particular result stood out: related to word embedding (Shaik et al., 2022).Here, BERT (NN-based method) has an advantage over other methods like LSA because BERT can give word representations dynamically, as opposed to fixed representation, irrespective of context.This shows a massive advantage of Neural-based techniques: their adaptability.Salloum et al. ran a survey to review the state of semantic analysis approaches.They found that approaches based on machine learning can sometimes be ambiguous and that they need to rely on other approaches to help alleviate these problems, such as Latent Semantic Analysis and Explicit Semantic Analysis (Salloum, et al., 2020).Gao et al. proposed a method to automatically annotate semantic information for online BIM (Building Information Modeling) product resources.They ended up using Latent Semantic Analysis for their document-level annotation method (Gao et al., 2017).However, due to its inability to generate word-level annotations, word-sense disambiguation based on context analysis was used alongside LSA to create a whole level of annotation.Gottron et al. reviewed the ESA (Explicit Semantic Analysis) method, explained how it works, and gave some insightful points regarding its uses, advantages, and disadvantages.They found that the method performs and classifies well with a large enough corpus of text (Gottron et al., 2011).However, it struggles a bit more when using a corpus of text with a large variety, such as Wikipedia, and it introduces much noise in the results.Mohamed (Mohammed et al., 2019) proposed a text summarization algorithm based on ESA and SRL (Semantic Role Labelling).To test out their software, they compared it to other existing software solutions, such as Microsoft Text Summarizer and System 19, which are based on term frequency calculations and the GISTEXTER architectures, respectively.They found that the ESA + SRL method was always in the top 3 best performing methods for SDS (Single Document Summarization) as well as MDS (Multiple Document Summarization), usually performing the best.
Haralambous (Haralambous et al., 2014) performed a study to find the effect of using thematically enhanced context on the performance of ESA techniques.This process required them to assign "TF-IDF" values to each word in the text corpus.They also applied stemming and lemmatization to the text, limiting the ESA so it only reads the French Wikipedia database.These steps allow for curating the corpus of text that ESA will later process.They found that this has helped the model improve its classification performance by a precision of 9 -10%.
Moraes (Moraes et al., 2013) performed a comparison study where they compared the use of SVM (Support Vector Machines) and ANN (Artificial Neural Network).They were tested on reviews of different emotions (negative, positive, and neutral) for various contexts.They found that the ANN either outperformed or matched the performance of the SVM model, where the ANN performed significantly better in movie reviews.They also found that ANN has a much shorter running time than SVM at the cost of longer training times for the ANN.Le (Le & Mikolov, 2014) proposed a paragraph vector neural network architecture for semantic analysis.They compared their model to typical BoW (bag of words) models.They found that their architecture outperforms both BoW and several other models and exhibits different weaknesses, such as not considering the ordering of words.Zhai (Zhai et al., 2016) performed a comparison study comparing their proposed autoencoder-based neural network model to others that are commonly used in the semantic analysis field.Some of the pitfalls of autoencoder-based models are that they perform worse with documents with an extensive vocabulary and need to scale better, so they introduced supervision via a loss function and posterior probability distribution.Compared to BoW, DAE, FFNN, and LrDrop, their model usually performed equal to or better than the competition.Akila (Akila & Jayakumar et al., 2014) reviewed many semantic analysis methods and compared their advantages and disadvantages.Some reviewed methods include LSA, ESA, GLSA (Generalized LSA), and MMR (Maximum Marginal Relevance).They found that for corpus-based approaches, some advantages include no redundancy, robustness, and highly structured results.In contrast, disadvantages include difficulty in computing, suffering from noise, and the inability to measure the degree of similarity between two relations (for LSA).Table 1 presents a summary of the explored Semantic Analysis Models.
Semantic analysis is crucial in developing Natural Language Processing (NLP) applications.Studies have been conducted to review the functionality of Latent Semantic Analysis (LSA) and its uses and weaknesses.LSA is flexible but needs to consider sentence-level individual document meaning, making it a major flaw.Approaches based on machine learning can sometimes be ambiguous and need to rely on other approaches, such as Latent Semantic Analysis and Explicit Semantic Analysis.

Research Methodology
This manuscript aims to implement a systematic review to explore the current sentiment analysis techniques, advantages, disadvantages, and applications.This systematic review of literature focuses on research papers whose main topic of interest is Semantic Analysis, and its structure is based on the steps proposed by Kitchenham et al. to deploy a systematic review (Kitchenham & Charters, 2007).The published research papers were gathered from the Google Scholar database, and their quality was assessed based on how much information they provided concerning the above key factors (implementation, advantages, disadvantages, and uses).Once a paper with suitable quality is found, a summary of its key points is created by skimming through all the sections and extracting relevant info, focusing mainly on the implementation, results, and discussion sections.Ultimately, the manuscript will bid these techniques against each other, and a comparison finding will be made.

Search Criteria
The main Sentiment Analysis techniques that will be focused on in this manuscript are LSA (Latent Semantic Analysis), ESA (Explicit Semantic Analysis) and NN (Neural Net) based techniques.The following searches were run to collect relevant research papers for this systematic review: • "Semantic Analysis" + "LSA" + "Latent" • "Semantic Analysis" + "ESA" + "Explicit" • "Semantic Analysis" + "Neural Network" As for the eligibility criteria, the search was only focused on review articles and all years after 1998 were considered.The process that was followed to gather the research papers is highlighted in Figure 3, which was built in accordance with the PRISMA guidelines (Moher et al., 2010).

Semantic Analysis Implementation
There are various methods to implement semantic analysis, including rule-based methods, statistical methods, and machine learning techniques.Rule-based methods use predefined rules to identify patterns in language and assign meaning to them.Statistical methods rely on probabilistic models to analyze language and identify patterns, such LSA (Latent Semantic Analysis), and Explicit Semantic Analysis.Machine learning techniques involve training algorithms on large datasets of language to identify patterns and make predictions about meaning, such as Probabilistic Latent Semantic Analysis (PLSA), and BERT (NN-based method).

Latent Semantic Analysis (LSA)
LSA is an NLP technique that uses statistical methods to process a large corpus of text to group words that have similar meaning or are related.It is an unsupervised method, meaning that it does not need any extra information, human knowledge, or any input other than the raw text.It essentially quantifies all the contexts that a particular word can and cannot appear in, and using that data, it can provide a set of constraints to determine the similarity of words to each other and their relationships.It is relatively similar to Neural Network based solutions in the sense that it relies on mathematical values and hidden relationships between words that are not understandable to humans (Landauer et al., 1998).LSA's uses a BoW model (Bag of Words), wherein it converts a corpus of text into a term-document matrix.This allows it to extract the frequency of all the terms in the corpus, but it will not take the order of those words into account.This method operates under the assumption that all related words or words with similar meanings will appear very close to each other in the text (Shaik et al., 2022).SVD (Singular Value Decomposition) takes the term-document matrix as an input and produces 3 matrices that are structured as in Figure 4. Using the newly created matrices, it can find the hidden relationships between the words and group them together accordingly in classes.The program has no concept of understanding the actual meaning of text like humans do, so the classes are not labelled (which is one of the main disadvantages of LSA).

Explicit Semantic Analysis (ESA)
ESA is an NLP technique that also uses statistical methods to process large corpuses of text.However, this method differs from LSA in a couple of different ways.For starters, it is a supervised method.This means that, unlike LSA, it has access to other data and even access to human knowledge alongside the raw data that it needs to process.For example, in the paper that this method was proposed, Gabrilovich et al. ended up using Wikipedia as the database for the classification and scoring [17].This also means that it does not have the same issue as in LSA, where it is purely grouping things using statistical analysis, ESA can provide labels for all the data since it has access to human knowledge.Another difference can be seen in the way it processes the test.It does not use the same system of converting the text into a large matrix and then converting it into smaller context matrices using SVD.Instead, it converts all the words in the corpus of text into concept (TFIDF) vectors, which has the word's corresponding concept and its frequency.The structure of this type of data processing can be seen in Figure 5.The higher the frequency, the higher the importance of that particular word to the text corpus.The similarity of two words can be computed by comparing their concept vectors using the cosine angle formula.The closer the value is to 1, the more similar the two words are, as determined in equation (1).where "u" and "v" are two concept vectors.Having access to the frequencies (TFIDF) scores for all of the words in the corpus of text is crucial for tasks like text summarization, where we can aggregate the top X% of highest TFIDF scoring words and then group them together using another algorithm.

Neural Network based Solution
Neural Network based solutions have been around in NLP for a very long time (since the 1940s).They went through three different phases, 1940 -1970, 1980 -1990, and 2000 present day, each having different reasons for "shutting down" research (lack of basic XOR functionality for the first phase and not enough data/ processing power for the second phase).Most neural networks follow a similar design philosophy, where they consist of a small component that is arranged in layers, and those layers are connected to form a neural network.This structure can be seen in Figure 6.Throughout these phases, several different neural network architectures were proposed (Yousif & Yousif, 2023).
• Feed Forward Neural Network (FFNN): Made of neurons and data can only flow forwards since the output of one neuron only connects to the input of a neuron in the next layer.
• Convolutional Neural Network (CNN): Similar to FFNN but operates on data that is arranged in 2D space, operating similarly to how the brain processes data.
• Recurrent Neural Network (RNN): Made of neurons and data can flow backwards (recursion), meaning the network can have memory.
• LSTM (Long Short-Term Memory) based Neural Networks: Replaces neurons in RNN and consists of a forget, input and output gate, allowing it to both have memory of past data and allowing it to forget certain data.
• Encoder-Decoder based Neural Networks: Based on LTSM architecture, uses context vectors to process data.
• Transformer based Neural Networks: Based on Encoder-Decoder architecture but puts the encoders and decoders on stack and makes them all connected so each singular unit can access all its neighbors' states.
Currently, the state-of-the-art architecture in Neural Network technologies is Transformer based, and models such as ChatGPT and BERT rely on this architecture.As for Sentiment Analysis specific applications, neural networks can operate on many different levels of lexical structures.If document level semantic parsing is required, neural networks can then use several different models, such as BoW (Bag of Words), Bag-of-N-grams (extension of BoW), and word embeddings.Afterwards, these are then converted into dense document vectors, which contain both syntactic and semantic properties of the word (Xin et al., 2023).As for sentence level parsing, due to their small size (compared to documents), some outside processing like parse trees, opinion lexicons and part of speech tags are used before the data is passed to the neural network.Some proposed architectures for sentence level processing neural networks include examples like Recursive Autoencoders Networks (RAE), Matrix-vector Recursive Neural Network (MV-RNN) and Recursive Neural Tensor Network (RNTN) (Zhang et al., 2018).

Results & Discussion
The This manuscript followed the systematic review structure that was proposed by Kitchenham (Kitchenham & Charters, 2007).Firstly, key factors such as implementation, advantages, disadvantages, and uses of the proposed semantic analysis model were all identified for investigation during the data gathering phase.The database that was used for this investigation was Google Scholar, and it was used to identify all relevant papers to this systematic review that met the eligibility criteria.This paper examined a total of 15 papers based on the selection process in Table 1.
This manuscript focused on 3 models for implementing semantic analysis: LSA, ESA and Neural Network based solutions.However, there are many more methods that were only briefly mentioned, or included only within the reviewed papers themselves, such as LDA, LSI, PLSA, GLSA, SVM, MMR and SRL.It has also managed to identify 298 that is capable of both are Neural Network based models.This is because LSA is based on BoW model and ESA is based on concept vector model, which both operate on the document level.
We can also interpret the popularity and depth of research on each of the 3 main methods based on the results numbers that are present in Figure 4.The Neural Network based model search is at the top, with 2710 results.This is quite surprising, since neural network-based solutions have only been viable in the last decade or so, compared to the other methods which were established and used, and were even considered the standard much earlier than that.
This shows that neural networks are an incredibly hot topic in the semantic analysis academic space, and due to their use in the most successful and mainstream language models ChatGPT and BERT, this is the direction that semantic analysis is heading.In second place is LSA with 1350 results.This is not very surprising, since it was the earliest established method out of the three, and it was the standard at some point in history.In last place is ESA, with 115 results.In the papers where this method was researched, it was always mentioned as a supervised, improved version of LSA.However, from the results, that this was a stopgap solution that was quickly replaced with Neural Network based models once those matured.
As for advantages, disadvantages, and uses, LSA is based on the BoW model, which means it carries the same advantages and disadvantages as those.Firstly, due to its flexibility, it can be used in many applications, including word meaning, memory, speech coherence and it is effective for general modeling of diverse information sources such as newspapers, press releases and even tweets.However, due to its generality, it requires a lot of input data for it to function properly.Also, due to its BoW structure, it disregards sentence-level individual document meaning arising from word order.As for ESA, it is a supervised model (unlike LSA, which is unsupervised), meaning it has access to data and human knowledge outside of the raw input text.It performs incredibly well at Single Document Summarization and Multiple Document summarization, due to its ability to rely on diverse databases like Wikipedia for labels and other input information.However, this large database access can also lead to its downfall.Without extra preprocessing steps such as stemming and lemmatization, or limiting to a database of a specific language, the diverse text corpora can introduce a lot of noise in the output.
And finally, for Neural Network based models, they are currently the state-of-the-art researched models.Some of the main advantages they have is they have a much shorter running time after the initial training of the model and that they are much more adaptable compared to the fixed nature of LSA and ESA.They are also overall the best performing models in most tasks mentioned above (such as analyzing movie reviews for example).However, it also has some downsides of its own.One of the main ones is it is much more demanding both time and computation wise to train the model initially.Also, it struggles a bit with scalability and ambiguity (due to its more adaptable and dynamic nature), which means it needs to rely on techniques such as supervision via loss function and posterior probability distribution.

Conclusion
This manuscript aims to implement a systematic review based on the steps proposed by Kitchenham to find out the implementation, advantages, disadvantages and uses of LSA, ESA and Neural Network based Semantic Analysis models.Google Scholar was the only database that was consulted for research papers to be used in this review.The data collection was performed as of December of 2023 and only articles written in English, or have an English translation were used as part of this work.Three search queries were implemented to filter the papers, which comprise of ("Semantic Analysis" + "Neural Network"), ("Semantic Analysis" + "ESA" + "Explicit") and ("Semantic Analysis" + "LSA" + "Latent").These searches returned 2710, 115 and 1350 results respectively.Also, a list of key factors was proposed to ensure the quality and relevance of papers that will be included in this systematic review.These results showed that the most popular Semantic Analysis model was Neural Network based models with 2710 results.
The results of this study show that for most scenarios and cases, the Neural Network based solutions usually perform the best.Due to their wide range of available models and architectures, as well as their adaptability and dynamic nature, there is usually a Neural Network model that is suitable for most cases.However, there are specific scenarios where older methods can shine.Depending on some circumstances such as limited computational resources, or the lack of need for some of the features that a Neural Network model can provide (like distance between similar meaning words for instance), it might be beneficial to consider using older models such as LSA and ESA.

Figure 3 :
Figure 3: Studies selected in accordance with eligibility criteria and key factors, based on PRISMA guidelines(Kitchenham & Charters, 2007)

Figure 4 :
Figure 4: Structure of corpus of text after processing with LSA

Figure 5 :
Figure 5: Structure of data after it has been processed by ESA

Figure 6 :
Figure 6: Basic structure of a general Neural Network

Table 1 :
A summary of the Semantic Analysis Models