aboutsummaryrefslogtreecommitdiffstats
path: root/libibex/TODO
diff options
context:
space:
mode:
Diffstat (limited to 'libibex/TODO')
-rw-r--r--libibex/TODO61
1 files changed, 61 insertions, 0 deletions
diff --git a/libibex/TODO b/libibex/TODO
new file mode 100644
index 0000000000..a087c8d1f3
--- /dev/null
+++ b/libibex/TODO
@@ -0,0 +1,61 @@
+Stability
+---------
+* ibex_open should never crash, and should never return NULL without
+errno being set. Should check for errors when reading.
+
+
+Performance
+-----------
+* Profiling, keep thinking about data structures, etc.
+
+* Check memory usage
+
+* See if writing the "inverse image" of long ref streams helps
+compression without hurting performance now. (ie, if a word appears in
+more than half of the files, write out the list of files it _doesn't_
+appear in). (I tried this before, and it wasn't working well, but the
+file format and data structures have changed a lot.)
+
+* We could save a noticeable chunk of time if normalize_word computed
+the hash of the word and then we could pass that into
+g_hash_table_insert somehow.
+
+* Make a copy of the buffer to be indexed (or provide interface for
+caller to say ibex can munge the provided data) and then use that
+rather than constantly copying things. ?
+
+
+Functionality
+-------------
+* ibex file locking
+
+* specify file mode in ibex_open
+
+* ibex_find* need to normalize the search words... should this be done
+by the caller or by ibex_find?
+
+* Needs to be some way to do a secondary search after getting results
+back from ibex_find* (ie, for "foo near bar"). This either has to be
+done by ibex, or requires us to export the normalize interface.
+
+* Does there need to be an ibex_find_any, or is that easy enough for the
+caller to do?
+
+* utf8_trans needs to cover at least two more code pages. This is
+tricky because it's not clear whether some of the letters there should
+be translated to ASCII or left as UTF8. This requires some
+investigation.
+
+* ibex_index_* need to ignore HTML tags.
+ NAME = [A-Za-z][A-Za-z0-9.-]*
+ </?{NAME}(\s*{NAME}(\s*=\s*({NAME}|"[^"]*"|'[^']*')))*>
+ <!(--([^-]*|-[^-])--\s*)*>
+
+ ugh. ok, simplifying, we get:
+ <[^!](([^"'>]*("[^"]*"|'[^']*'))*> or
+ <!(--([^-]*|-[^-])--\s*)*>
+
+ which is still not simple. sigh.
+
+* ibex_index_* need to recognize and ignore "non-text". Particularly
+BinHex and uuencoding.