Each segment can be said to have four logical parts: postings, stored fields, the deletions file, and the
term vectors data.
Storedfields
The stored fields are organized into two files.
• [seg_name].fdx - Field inDeX - pointers to field data
• [seg_name].fdt - Field DaTa - the actual stored fields
When a document turns up as a hit in a search and must be retrieved, KinoSearch1 looks at the Field inDeX
file to see where in the data file the document's stored fields start, then retrieves all of them from
the .fdt file in one lump.
_1.fdx--|
|--[doc#0 => 0]----->_1.fdt--|
| |--[bodytext]
| |--[title]
| |--[url]
|--[doc#1 => 305]----->_1.fdt--| # byte 305
| |--[bodytext]
| |--[title]
| |--[url]
|--[...]--------------->_1.fdt--|--[...]
If a field is marked as "vectorized", its "term vectors" are also stored in the .fdx file.
Postings
"Posting" is a technical term from the field of Information Retrieval which refers to an single instance
of a one term indexing one document. If you are looking at the index in the back of a book, and you see
that "freedom" is referenced on pages 8, 86, and 240, that would be three postings, which taken together
form a "posting list". The same terminology applies to an index in electronic form.
The postings data is spread out over 4 main files (not including field normalization data, which we'll
get to in a moment). From lowest to highest in the hierarchy, they are...
[seg_name].prx - PRoXimity data. A list of the positions at which terms appear in any given document.
The .prx file is just a raw stream of VInts; the document numbers and terms are implicitly indicated by
files higher up the hierarchy.
[seg_name].frq - FReQuency data for terms. If a term has a frequency of 5 in a given document, that
implies that there will be 5 entries in the .prx file. The terms themselves are implicitly specified by
the .tis file.
_1.frq--|
|--[doc#40 => 2]----->_1.prx--|--[54,107]
|--[doc#0 => 1]----->_1.prx--|--[6]
|--[doc#6 => 1]----->_1.prx--|--[504]
|--[doc#36 => 3]----->_1.prx--|--[2,33,747]
|--[...]------------->_1.frq--|--[...]
[seg_name].tis - TermInfoS. Among the items stored here is the term's doc_freq, which is the number of
documents the term appears in. If a term has a doc_freq of 22 in a given collection, that implies that
there will be 22 corresponding entries in the .frq file. Terms are ordered lexically, first by field,
then by term text.
_1.tis--|
|--[...]----------------------->_1.frq--|--[...]
|--[bodytext:mule => 1]-->_1.frq--|--[doc#40 => 2]
|--[bodytext:multitude => 3]-->_1.frq--|--[doc#0 => 1]
| |--[doc#6 => 1]
| |--[doc#36 => 3]
|--[bodytext:navigate => 1]-->_1.frq--|--[doc#21 => 1]
|--[...]----------------------->_1.frq--|--[...]
|--[title:amendment => 27]-->_1.frq--|--[doc#21 => 1]
| |--[doc#22 => 1]
|--[...]----------------------->_1.frq--|--[...]
[seg_name].tii - TermInfos Index. This file, which is decompressed and loaded into RAM as soon as the
IndexReader is initialized, contains a small subset of the .tis data, with pointers to locations in the
.tis file. It is used to locate the right general vicinity in the .tis file as quickly as possible.
_1.tii--|
|--[bodytext:a => 20]---------->_1.tis--|--[bodytext:a] # byte 20
| |--[bodytext:about]
| |--[bodytext:absolute]
| |--[...]
|--[bodytext:mule => 27065]---->_1.tis--|--[bodytext:mule]
| |--[bodytext:multitude]
| |--[...]
|--[title:amendment => 56992]-->_1.tis--|--[title:amendment]
|--[...]
Here's a simplified version of how a search for "freedom" against a given segment plays out:
1. The searcher asks the .tii file, "Do you know anything about 'freedom'?" The .tii file replies,
"Can't say for sure, but if the .tis file does, 'freedom' is probably somewhere around byte 21008".
2. The .tis file tells the searcher "Yes, we have 2 documents which contain 'freedom'. You'll find them
in the .frq file starting at byte 66991."
3. The .frq file says "document number 40 has 1 'freedom', and document 44 has 8. If you need to know
more, like if any 'freedom' is part of the phrase 'freedom of speech', take a look at the .prx file
starting at..."
4. If the searcher is only looking for 'freedom' in isolation, that's where it stops. It already knows
enough to assign the documents scores against "freedom", with the 8-freedom document scoring higher
than the single-freedom document.
Deletions
When a document is "deleted" from a segment, it is not actually purged from the postings data and the
stored fields data right away; it is merely marked as "deleted", via the .del file. The .del file
contains a bit vector with one bit for each document in the segment; if bit #254 is set then document 254
is deleted, and if it turns up in a search it will be masked out.
It is only when a segment's contents are rewritten to a new segment during the segment-merging process
that deleted documents truly go away.
FieldNormalizationFiles
For the sake of simplicity, the example search scenario above omits the role played the field
normalization files, or "fieldnorms" for short. These files have the (theoretical) suffix of .f followed
by an integer -- .f0, .f1, etc. Each segment contains one such file for every indexed field.
By default, the fieldnorms' job is to make sure that a field which is 100 terms long and contains 10
mentions of the word 'freedom' scores higher than a field which also contains 10 mentions of the word
'freedom', but is 1000 terms in length. The idea is that the higher the density of the desired term, the
more relevant the document.
The fieldnorms files contain one byte per document per indexed field, and all of them must be loaded into
RAM before a search can be executed.