I attended Dexy and Molly workshops and Ask the Experts discussions.
Ask the Experts about how to go about Dealing with dirty data
Top tips
- Google Refine – CSV clean up; output into other formats
- AntiWord – Word formats to plain text converter
- FMT (formatting)
- Beautiful soup (python) – scrapper
- Scrapperwiki – remember this can be useful – can be used like a remote data store
- Python unicodedata.normalize – to format data into normal form C – flatten aceented characters
- Mozilla has auto detect character encoding tools