Thanks for the comments. FYI, I will not be changing the .spml
files. These files will be available as a custom export. It will
take several seconds to create.
On 10/7/11 12:03 PM, Jonathan wrote:
[log in to unmask]" type="cite">
Speed, portability, and simplicity. I just completed my proof of
concept for fuzzy searching. From a 3 MB file with over 10 thousand
signs, I can get accurate search results in less than 1 second. I
am using Regular Expressions to process ASCII characters. Next
week, I'll write about fuzzy searching, with the appropriate links
for the proof of concept.
I don't remember why you want to use a string in the XML file
for the signs.
[log in to unmask]" type="cite">Wouldn't
building everything out of XML be easier to work with?
Yes and no. Yes, because XML offers organization and portability.
No, because XML has a lot of overhead and gotchas. The libraries
take time to process text. Not all libraries work the same or
support the same feature set. I think XML is too robust for simple
[log in to unmask]" type="cite">Many
libraries can parse XML back to objects or save to a database to
do calculations and searches on. My feeling is that XML and
what's in it should be primarily for transporting data.
Can you show me an example of the type of XML you'd want to use for
an individual sign?
[log in to unmask]" type="cite">In my
personal opinion, information that is one piece in itself
shouldn't be concatenated with other data and then have to do
special parsing to get a specific part of it.
I can understand the logic and agree in part. For me, sign text
should be like regular text. This means spaces separate words. For
me, each word is a piece unto itself and should be concatenated
without spaces or punctuation because it is a unit.
[log in to unmask]" type="cite"> So I
don't really like the 6 digits you are proposing below.
You can continue to use the premilinary Unicode strings if you
prefer. I've found that the ASCII version can be processed 4 times
faster or more. The ASCII regular expressions as always consistent,
but the Unicode uses 3 different strings based on the encoding form
of UTF-8, UTF-16, and UTF-24.
[log in to unmask]" type="cite">But if
we are going to have to parse it then at least make it easy to
distinguish the parts. It think that if you are going to keep the
string notation then, maybe the information should be enclosed
within an identifying symbols. Something like
Commas and parenthesis add punctuation to the string causing many
unusual side effects and increase the possibility of a broken
for the coordinates (41,60), (-18,-18) and (11,-23)
I do agree with your point. The current coordinate notation is
sloppy. I've employed a simple fix. I add 500 to each value. This
means coordinates will always be 7 characters long: 3 for the X
value, 1 for the separating value, and 3 for the Y value.
The coordinate (41,60) becomes 541x560. The coordinate (-18,-18)
becomes 482x482. I was not planning to update the preliminary
Unicode version with the new coordinate strings unless someone
requested it. So for the .spml files, I'm not planning any changes.
[log in to unmask]" type="cite"> What
about C for coodinate, then the X or Y value + 500 to get the the
Unicode point value. One Unicode character for X and one for Y?
Additional Unicode characters are not being considered right now
because there is no consensus on the higher level protocols of
SignWriting for Unicode. Instead of the coordinate style of
SignPuddle, they may choose a conceptual design based on deeper
But if the 2nd Unicode proposal did choose to go with coordinates, 1
or 2 rows of negative values and 1 or 2 rows of positive values
would be best.
As per your above preference, there is no reason to concatenate the
X and Y values into a single character, although a single character
for each point on a 2-dimensional grid of 256 by 256 does have a
[log in to unmask]" type="cite"> If you
do go with what is below, I can make it work for my program. I
don't have any issues with the new limited size of the axis to
-500 to +499
I'm glad you don't mind the size limitation. This is the biggest
change and it is mainly a validation issue.
[log in to unmask]" type="cite"> I am
interested in your thoughts or comments on the above
Thanks for the comments.