Dev8D 2011 – day 2

I attended Dexy and Molly workshops and Ask the Experts discussions.

Ask the Experts about how to go about Dealing with dirty data

Top tips

  • Google Refine – CSV clean up; output into other formats
  • AntiWord – Word formats to plain text converter
  • FMT (formatting)
  • Beautiful soup (python) – scrapper
  • Scrapperwiki – remember this can be useful – can be used like a remote data store
  • Python unicodedata.normalize – to format data into normal form C – flatten aceented characters
  • Mozilla has auto detect character encoding tools