since we want to implement full text search in our application soon, we made some tests with Tika as text extractor and Lucene as index provider. We know that there are two steps: the construction of the index and after that the searching. Firstly the content of the document is converted to plain text, followed by a Tokenizer and Filtering. After that the index gets constructed. Since we did not find everything in the documentation there are still some questions left about how everything works in detail:
1. When is the text extraction triggered ? I found something about the extraction here How does index persistence work in ModeShape ? but it is unclear if the extraction happens when adding a new binary property or when the query is executed. Also the post is almost two years old, so it might not be up-to-date.
2. Is the extracted text stored anywhere or is it just available temporary ?
3. How does the construction of the index work in detail?
Thank you in advance for your response, everything is working fine so far, we are just curious about the details.