- Obtain and preprocess the dataset: Obtain the MIMIC-III
dataset and preprocess it to extract the relevant information.
This includes extracting the note ID, chart date, note text, hospital expire flag, and all diagnoses ICD9 codes associated with the note via admission ID. You can use Python or any
other programming language to extract this information from
the dataset. - Index the notes with Sol: Use Sol to index the notes using
the information extracted in step 1. Use the note ID
(noteevents.row_id) as the Sol Document ID. - Build a user interface: Develop a user interface that allows
users to enter query conditions and returns a list of satisfying notes. You can build a web UI or an interactive command-line interface. - Allow Lucene Query Syntax: Allow users to enter queries using
Lucene Query Syntax. This will enable them to search within
one or a combination of all the required information in step - 1(a) without knowing the field names in the Lucene index.
- Use query expansion for synonyms: Implement a function to
expand queries with synonyms. Use the Consumer Health
Vocabulary (CH) in UMLS to get all English synonyms of the
input term from the user. Limit the synonyms to 30 terms for
better performance. - Enable user control over query expansion: Allow users to
control whether or not to use query expansion in the final
query condition. - Evaluate the system: Run the system against the query
conditions mentioned in the question to evaluate its
performance.
Requirements: As required in the attached file | .doc file | Python