Wednesday, March 31, 2010

Busy Week!

Had some midterms for my graduate school courses coming up this week, so my hands have been a bit tied.

I'll be publishing an initial release of the python-moses wrapper I slapped together (with caveats) this weekend (after easter festivities).

I also plan to clean up svn and rename the googlecode project to something slightly more meaningful than "ccmts" (yet-another-acronym).

Another Great Idea

I had another idea that runs parallel to CrisisTerp. Something slightly more generic but, also could benefit not only CrisisTerp but another related translation issue.

The fruits of this work I plan to discuss tomorrow with one of my professors as the ramifications could lead to some interesting research in Machine Translation Evaluation.

ct


Wednesday, March 24, 2010

std::string vs. std:wstring which C++ string datatype best supports UTF-8?

More bug resolution!

I managed to get the ht-en part of the site working. The issue was related to a somewhat confusing configuration management bug.

Modified the site so it handles user translation submissions in a more context sensitive manner.

I also added an icon to the site, gets rid of the cherrypy icon that was previously being rendered by default.

I'll be fiddling with the ht-en side of things over the next week until I can figure out that mess.

std::string vs. std::wstring

I found out there are issues with utf-8 encoded strings crossing into the python-moses backend. I've been told C++ std::strings can handle unicode, but when the STL has a special std::wstring data types for "wide strings"...I'm lead to believe otherwise. I even think swig supports std::wstring.

In anycase, that just means a little more fiddling on my part with this aspect of the python-moses mojo.

Any tips on converting python-unicode strings into something C++ can process (to std::string or std::wstring? is the question) would be welcomed!

So far I've seen the following blurbs on this issue:


ct

Tuesday, March 23, 2010

Started Beta testing!

Start Your Translation Engines!

The CrisisTerp site instance is up and running on my slicehost space! After a good bit of debugging today, it seems to be working well.

Initial Bugs

Bug: Turns out the moses-python code I wrote was a complete debacle. The code was good for only 1 translation. After completing 1 translation, my moses-python wrapper would crash the entire site.

Solution: Python Multiprocessing to the rescue! I was able to use multiprocessing to run a translation job, store the translation result, wipe the moses-python memory space, and re-instance that code for subsequent jobs. very simple fix, not a long term fix though.

Bug: For some of the unknown sequences, moses generates an "|UNK" sub-string in the translation output. Simple enough fix for regexs.

Bug: Google-Analytics. I wanted to track usage statistics and that sort of thing to see if it's worth maintaining a site (slicehost ain't free) . I had some issues with the javascript google gave me, but I managed to work around the issues. Google-Analytics is now tracking CrisisTerp! On that note, I can't say enough positive things about CherryPy, web developing isn't my forte but, CherryPy has made all the difference!

Thought: If the site generates enough traffic, then I'll probably consider monetizing parts of it (ad-sense or ad-sensing this blog) to reduce the cost of running the site. If traffic load is dense enough, I may need to consider purchasing more space from slicehost. In the event I choose to go this route, I'll make the dollar amounts (costs, etc) publicly available. If the income ends up turning some amount of positive balance, then I'll donate that additional money to the Redcross. Seems fitting.

Next Step?

I need to package the site code, my parallel corpora, and the moses-python wrapper for deployment on the ccmts google code page OR on a new google code page. I'm tempted to fire up a new page...then again, that's a lot more work than it's worth.

Future Projects

I have a lot of ideas for where to take this system.
  • I've some additional languages I'd like to target.
  • Expect to see more language support added in the future, webservice integration features (registration form for a UUID so I can better track bandwidth consumption)
  • Integration with social networking sites!
That about wraps it up for now. I'm off to Boston for PAX this Friday, so I'll be out of the loop. Let it be known, I'll be working toward the packaging of corpora and source code next week!
ct