version 0.12.1

This is sort of significant update with massive improvements in data extraction. Version 0.12.1 added several different methods of article detection and it combines all of these method to get clean and complete page.

  • Improved accuracy of article’s detection and extraction.
  • Added ability to process multi-page articles.
  • Added support for the new binary formats - doc, xls, ppt, vsd, vst, zip, gz, tgz.
  • Added ability to adjust extraction rules manually. It should allow almost instant fix for reported problems.
  • Added multi-step analysis of extracted data with automatic tunning, if necessary.
  • Smarter detection of article’s title.
  • Added initial content extraction from emailed tweets. So far tested just on a few iOS clients. Your feedback will help to make it right.
  • Fixed an issue with broken formating for some extracted pages.
  • Fixed an issue with partial articles for some links.
  • Improved processing of non-article/index pages, like search results and so on.
  • Fixed procession of links with spaces.
  • Fixed incorrect support link in ukeeper’s error email.
  • Fixed lack of error email in case if response too big and can not be send.
  • Disabled support for multi-line links until I have some smart way to handle it without affecting pure, single-line links with some content below.

Processing

Please, wait...