Geometric and Topological Approaches for Natural Language Processing
Most Natural Language Processing (NLP) methods are based on semantic, syntactic or lexical concepts. However, these methods do not consider the structural aspects of key elements in language, such as the geometric structure of a sentence in the embedded vector space. In our research, we suggest geometric and topological approaches, which leverage these structural aspects for core NLP tasks.
We demonstrate the merits of this idea in cross-lingual settings. Concretely, we propose novel structural-based approaches for the generation and comparison of cross lingual sentence representations. We do so by applying geometric and topological methods to analyze the structure of sentences, as captured by their word embeddings.
Our methods are based on geometric concepts, such as comparing the intra-distances between the words of the sentence, as well as Topological Data Analysis (TDA). TDA is a collection of data-driven methods, mainly based on the mathematical field of algebraic topology. TDA methods provide simplified representations for complex data, have high noise tolerance, and are invariant to various types of continuous transformations.
The key properties of our methods are:
(a) They are isometric invariant, thus providing near language-agnostic representations.
(b) They are fully unsupervised, and use no cross-lingual signal.
The quality of our representations, and their preservation across languages, are evaluated in similarity comparison tasks, achieving competitive results. Furthermore, we show that our structural-based representations can be combined with existing methods for improved results.
* M.Sc. student under the supervision of Professor Omer Bobrowski.