Replacing smart quotes, em-dashes, and ellipses with MySQL or PHP


Alternate title: "Help!  My Quotes Appear as Question Marks or Other Strange Characters!"

The "Smart quotes" feature in Microsoft Office transforms straight quotes into curly quotes.  It also transforms hyphens into em-dashes and three periods into ellipses.  While one might think, "How lovely!  My document looks almost as if I'm educated!" readers of said document may not.  Microsoft, in its infinite wisdom, decided to assign special characters such as the ones I just mentioned to a range of codes above 128.  Problem: these codes were already assigned to other characters, resulting in frustrating incompatibility with non-Microsoft systems.

Keep reading for some PHP and MySQL code to help out with this issue, as well as a Joomla! plugin.


Our introduction to this was in a situation where we had people using many different systems submitting articles to one of our programs.  We decided that we wanted all our articles to use straight quotes, hyphens, and periods.  This was partly for consistency, and partly because these characters are common to many character sets and won't cause incompatibilities.

This article isn't really intended as a complete explanation of charsets, though we recommend The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets if you're interested in learning more.  We will however show you some MySQL and PHP techniques for replacing all instances of smart quotes, plus the en dash, em dash, and ellipsis with straight quotes, one or two dashes, or three dots.  This code should operate with both the Windows-1252 charset, and also UTF-8, an encoding with an extended character set that has made it the preferred encoding for email and websites.

MySQL:
# FIRST, REPLACE UTF-8 characters.
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28098, "'");
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28099, "'");
UPDATE `t` SET `c` = REPLACE(`c`, 0xE2809C, '"');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE2809D, '"');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28093, '-');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE28094, '--');
UPDATE `t` SET `c` = REPLACE(`c`, 0xE280A6, '...');
# NEXT, REPLACE their Windows-1252 equivalents.
UPDATE `t` SET `c` = REPLACE(`c`, CHAR(145), "'");
UPDATE `t` SET `c` = REPLACE(`c`, CHAR(146), "'");
UPDATE `t` SET `c` = REPLACE(`c`, CHAR(147), '"');
UPDATE `t` SET `c` = REPLACE(`c`, CHAR(148), '"');
UPDATE `t` SET `c` = REPLACE(`c`, CHAR(150), '-');
UPDATE `t` SET `c` = REPLACE(`c`, CHAR(151), '--');
UPDATE `t` SET `c` = REPLACE(`c`, CHAR(133), '...');
PHP:
// First, replace UTF-8 characters.
$text = str_replace(
 array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6"),
 array("'", "'", '"', '"', '-', '--', '...'),
 $text);
// Next, replace their Windows-1252 equivalents.
 $text = str_replace(
 array(chr(145), chr(146), chr(147), chr(148), chr(150), chr(151), chr(133)),
 array("'", "'", '"', '"', '-', '--', '...'),
 $text);
Joomla!:
Joomla! plugin for removing smart/curly quotes, em-dashes and ellipses.


Additionally, here's a table of character codes that you may find useful:
Character HTML Code Windows UTF-8 name
‘ 145 E28098 left single curly quote
’ 146 E28099 right single curly quote
“ 147 E2809C left double curly quote
” 148 E2809D right double curly quote
– 150 E28093 en dash
— 151 E28094 em dash
… 133 E280A6 ellipsis

Further reading from Wikipedia:
UTF-8

ISO/IEC 8859-1

Windows-1252
 
  1. Rod
    May 4th, 2009 at 12:08 | #1

    THANK YOU!!!!!!!

    I spent 2 hours trying to figure this out.  I was querying the jos_content table outside of Joomla, and it was a mess - as you can imagine.

    VERY much appreciated - Rod
     

  2. May 6th, 2009 at 17:43 | #2

    Thanks Rod!  Glad to hear this helped you out :)
     

  3. July 17th, 2009 at 14:01 | #3

    I also spent about 2 hours trying to figure this out. You're awesome Mango!
     

  4. Nirav Shah
    July 17th, 2009 at 14:36 | #4

    YOU ROCK Mango!
     

  5. September 28th, 2009 at 23:08 | #5

    Thanks a LOT dude!!!! I was about to give up on trying to replace ellipses with triple dots till I came across your post. You be a life saver :)

    - Ananth
     

  6. November 29th, 2009 at 16:05 | #6

    very useful - and easily extended to include \x99 and other goofiness. Many thanks.
     

  7. Jeremy
    December 1st, 2009 at 16:44 | #7

    You are a god among men. I've spent all day trying to fix these dumb smart quotes and nowhere else on the web could I find your bit where you replace the following characters. This solved it. I owe you everything.

    "\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6"
     

  8. December 6th, 2009 at 18:46 | #8

    You guys be careful - you're boosting my already-inflated ego! :)

    I am glad to hear this is helping.
     

  9. January 25th, 2010 at 13:53 | #9

    love it, works fine for body but does it fix the MS word characters in the titles? It is only fixing the body of the articles for me and not the title? Any way to fix this??? If so, I will post positive comment in JED and our user forum at 4RSS.

    4RSS
     

  10. February 7th, 2010 at 16:57 | #10

    Latest version adds this feature.
     

  11. February 23rd, 2010 at 16:59 | #11

    Dude, thanks. I was about to tear out one of my three remaining hairs. This was simple, rocks and lets me get on with life.
     

  12. February 23rd, 2010 at 17:13 | #12

    If Microsoft was preventing you from getting on with your life I am very glad to have been able to help with that!
     

  13. February 23rd, 2010 at 17:14 | #13

    Welll... I was referring to my programming life... :)
     

  14. theJoker
    August 18th, 2010 at 08:44 | #14

    Thanks for this, I spent a long, long time trying to convert those stupid windows curly quotes. I really appreciate it.
     

  15. cykan
    September 15th, 2010 at 07:30 | #15

    hey thx man. i had some probs too, especially because the ellipsis. ^^ Thx!
     

  16. Daniel
    September 22nd, 2010 at 14:10 | #16

    By the way, this conversion operates faster on large tables with a WHERE clause (PostgreSQL):

    UPDATE note
    SET note_text = REPLACE(note_text, E'\…', E'\.\.\.')
    WHERE note_text LIKE E'%\…%';
     

  17. Jesse
    March 27th, 2011 at 07:08 | #17

    u r a CHAMP! Thank you sir!
     

  18. Dan Metzner
    March 29th, 2011 at 17:45 | #18

    Thank you! I have to accommodate numerous people entering product information from various irritating sources including MS Word. This is such a simple lifesaver!
     

  19. Kosta Kontos
    June 3rd, 2011 at 06:56 | #19

    This is awesome! Works like a charm.
    Thanks so much :)
    Cheers from South Africa
     

  20. Eirik O.
    August 11th, 2011 at 03:31 | #20

    Thank you for this info!

    One note..
    I got an error in parsing the Ellipsis when using chr(133) in php.
    (It changed all occurrences of norwegian "Å")
    Php ord() gave 226 as the ellipsis and that worked well.

    Best Regards
    Eirik
     

  21. kbc
    October 25th, 2011 at 03:56 | #21

    Great solution, thanks for sharing
     

  22. gige
    November 11th, 2011 at 11:04 | #22

    Thank you mate, great smart/curly quotes, em-dashes and ellipses reference!
     

  23. Brandon
    December 1st, 2011 at 22:56 | #23

    Dude, you're AWESOME. This totally worked, couldn't find this info anywhere else.
     

  24. Mayur
    December 7th, 2011 at 23:03 | #24

    We spend about 2 hrs figuring the problem. This worked like a charm.. Very good solution
     

  25. December 15th, 2011 at 06:51 | #25

    Thank you! Customer has been driving me mad with this.
     

  26. January 29th, 2012 at 08:33 | #26

    Thank you!, quickly fixed a knotty problem.
     

Allowed HTML: <b>, <i>, <em>, <strong>. All other < and > will be replaced with &lt; and &gt;.