diff options
Diffstat (limited to 'libibex/TODO')
-rw-r--r-- | libibex/TODO | 61 |
1 files changed, 61 insertions, 0 deletions
diff --git a/libibex/TODO b/libibex/TODO new file mode 100644 index 0000000000..a087c8d1f3 --- /dev/null +++ b/libibex/TODO @@ -0,0 +1,61 @@ +Stability +--------- +* ibex_open should never crash, and should never return NULL without +errno being set. Should check for errors when reading. + + +Performance +----------- +* Profiling, keep thinking about data structures, etc. + +* Check memory usage + +* See if writing the "inverse image" of long ref streams helps +compression without hurting performance now. (ie, if a word appears in +more than half of the files, write out the list of files it _doesn't_ +appear in). (I tried this before, and it wasn't working well, but the +file format and data structures have changed a lot.) + +* We could save a noticeable chunk of time if normalize_word computed +the hash of the word and then we could pass that into +g_hash_table_insert somehow. + +* Make a copy of the buffer to be indexed (or provide interface for +caller to say ibex can munge the provided data) and then use that +rather than constantly copying things. ? + + +Functionality +------------- +* ibex file locking + +* specify file mode in ibex_open + +* ibex_find* need to normalize the search words... should this be done +by the caller or by ibex_find? + +* Needs to be some way to do a secondary search after getting results +back from ibex_find* (ie, for "foo near bar"). This either has to be +done by ibex, or requires us to export the normalize interface. + +* Does there need to be an ibex_find_any, or is that easy enough for the +caller to do? + +* utf8_trans needs to cover at least two more code pages. This is +tricky because it's not clear whether some of the letters there should +be translated to ASCII or left as UTF8. This requires some +investigation. + +* ibex_index_* need to ignore HTML tags. + NAME = [A-Za-z][A-Za-z0-9.-]* + </?{NAME}(\s*{NAME}(\s*=\s*({NAME}|"[^"]*"|'[^']*')))*> + <!(--([^-]*|-[^-])--\s*)*> + + ugh. ok, simplifying, we get: + <[^!](([^"'>]*("[^"]*"|'[^']*'))*> or + <!(--([^-]*|-[^-])--\s*)*> + + which is still not simple. sigh. + +* ibex_index_* need to recognize and ignore "non-text". Particularly +BinHex and uuencoding. |