Archive for the ‘MySQL’ Category

Convert all MySQL tables and fields to UTF8 charset & collation

October 22nd, 2012

At some point in your web dev life you might face the “Illegal mix of collations” mysql error. Nothing complicated, you just need all your tables to use the same charset and collation and this can be fixed really easy using phpmyadmin or even from the command line. However, when the database has 300+ tables, doing it manually is really not an option, so it needs to be done using some kind of a script. It can be done in perl, php, python or anything for that matter, but I particularly liked this one line command, using bash & awk (always loved awk):

mysql --user=username --password=your_pass --database=dbname -B -N -e "SHOW TABLES"  | awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;"}' | mysql --user=username --password=your_pass --database=dbname &

Found it on and worked like a charm.

Make sure to first read this article about converting charsets in mysql. Executing the command above might break things, depending on the database architecture. It’s always a good idea to create a backup of the database first.


When JOINs fail to work

May 6th, 2012

Any developer, junior or senior, should know and understand the basic rules of database normalization since they are the foundation of relational DBMS. Edgar Codd introduced the concept in 1970 and since then, most of the architectures have been following his rules. However, for high-traffic websites, sometime de-normalized is better than normalized, resulting in faster queries, less swap space and less storage space.

Although this post is not exactly about normalization, another common recommendation is to use a join rather than multiple single queries. In other words, rather than doing “select * from a where condition1” then “select * from b where condition2”, if a and b are in a relationship, it’s better to usually run a query like “select * from a inner join b on = b.aid where conditions”. A classic example is when using categories for a certain object, say movies. Let’s say each movie can have a category and you want to display a list of all movies and the category that movie is in. Some developers might write the following php code:

$rs = mysql_query('select * from movies');
while( $row = mysql_fetch_array($rs) ) {
  $rs2 = mysql_query('select * from categories where category_id = ' . $row['category_id'];
  $row2 = mysql_fetch_array($rs2);
  $category = $row2['category_name'];
  echo $row['name'] . "<br />Category: " . $category . "<br /><br />";

but there’s a simpler and (most of the time) better approach:

$rs = mysql_query('select * from movies m inner join categories c on c.category_id = m.category_id');
while( $row = mysql_fetch_array($rs) ) {
  echo $row['name'] . "<br />Category: " . $row['category_name'] . "<br /><br />";

The first approach generates 1 + N queries, where N is the number of movies. If N is a big number, things might take a while… The second approach only takes 1 query, leaving the hard part to the internal RDBMS engine. Well, sometimes, in certain scenarios, the first version is the right choice. It’s THE solution.

Let’s assume there are 4+1 tables – the main one, call it objects, one for tags, one for categories, another one called hits which will track the number of times a certain object has been seen by a user, and the last one for storing the n-to-n relationship between tags and objects, call it >code>object_tags. The ERD is rather simple:

Also, let’s assume we want to show a list of objects, with all their tags, their current category and the number of hits it has. The results will be paginated, only showing 20 items per page. Simple, right? Anyone would tend to write something like this for the first page:

SELECT * FROM objects o 
INNER JOIN categories c ON c.category_id = o.category_id
INNER JOIN object_tags ot ON ot.object_id = o.object_id
INNER JOIN tags t ON t.tag_id = ot.tag_id
INNER JOIN hits h ON h.object_id = o.object_id

This was what I wrote too, more or less. However, after spending many hours trying to figure out why that page was loading in more than 100 seconds, whenever it was loading at all, I’ve learned that sometimes it’s better to break the rules and not follow any recommendations. So I tried the following code, and the results were staggering – less than 1s loading time:

$rs = mysql_query('select * from objects limit 20');
$categories = $tags = array(); //used for locally caching categories and tags
while( $row = mysql_fetch_array($rs) ) {
  if( !isset($categories[$row['category_id']]) {
    $rsCat = mysql_query('SELECT category_name FROM categories WHERE category_id = ' . $row['category_id'] . ' LIMIT 1';
    $rowCat = mysql_fetch_array($rsCat);
    $categories[$row['category_id']] = $rowCat['category_name'];
    unset($rowCat, $rsCat);

  $objectTags = $toLoad = array();
  $rsObjTags = mysql_query('SELECT tag_id FROM object_tags WHERE object_id = ' . $row['object_id']);
  while( $rowObjTags = mysql_fetch_array($rsObjTags) ) {
    if( isset($tags[$rowObjTags['tag_id']]) ) {
      $objectTags[$rowObjTags['tag_id']][] = $tags[$rowObjTags['tag_id']];
    else {
      $toLoad[] = $rowObjTags['tag_id'];
  unset($rsObjTags , $rowObjTags);

  if( count($toLoad) ) {
    $rsTags = 'SELECT tag_id, tag FROM tags WHERE tag_id IN (' . implode(',', $toLoad) . ') LIMIT ' . count($toLoad);
    while( $rowTags  = mysql_fetch_array($rsTags) ) {
      $tags[$rowTags['tag_id']] = $rowTags['tag'];
      $objectTags[$rowTags['tag_id']][] = $rowTags['tag'];
  unset($toLoad, $rsTags, $rowTags);

  $rsHits = 'SELECT today_hits FROM hits WHERE object_id = ' . $row['object_id'] . ' LIMIT 1';
  $hits = mysql_fetch_array($rsHits);

  //ready to display the data
  echo $row['object_name'] . ' is in category ' . $categories[$row['category_id']] . '<br />';
  echo 'Tags: ' . implode(', ', $objectTags) . '<br/>';
  echo 'Hits today: ' . $hits['today_hits'] . '<hr />';

I hope there aren’t too many mistakes in the code, I wrote everything in here without actually testing it.

Seems like a lot of work and totally not optimized for something quite straight-forward. However, let’s assume once more that the tables above have the following number of records:
objects – 50,000
categories – 500
tags – 100,000
objects_tags – 250,000
hits – 20,000,000

In this case, 4 joins would result in a huge dataset, requiring a lot of memory, and eventually ending up on the disk. Also, the LIMIT doesn’t help at all, since the dataset is truncated at the end, after scanning all the necessary rows. However, in the unoptimized version we do benefit from the LIMIT, by only getting a fixed set (20 rows) from 50k rows. The rest of the queries are acceptable because they are all primary key based, they are using “LIMIT” as well (3 out of 4), and because their number is rather small – somewhere between 45 and 80 queries.

Hope this helps someone some day. I’d like to know your thoughts, what would have you done? Any other quick fixes for this problem?


How to export customers from orders between certain dates in Magento

August 23rd, 2010

Last week a client who runs his store on Magento Commerce asked me how he could export a list of customers who purchased between certain dates in August. Of course, I first went to the Administration area and tried several approaches (order reports, customer reports, data export). Unfortunately nothing worked, so I had to write a script or a plain SQL and run it from the CLI. I preferred a straight query, rather than a script, and I tried not to use sub-selects. Here’s the adapted part, where I’ve added some variables to keep all the attribute id’s:

SET @etID := (SELECT entity_type_id FROM eav_entity_type WHERE entity_type_code = 'order_address');
SET @atFn := (SELECT attribute_id FROM eav_attribute WHERE attribute_code = 'firstname' AND entity_type_id = @etID);
SET @atLn := (SELECT attribute_id FROM eav_attribute WHERE attribute_code = 'lastname' AND entity_type_id = @etID);
SET @atEmail := (SELECT attribute_id FROM eav_attribute WHERE attribute_code = 'customer_email');

SET @startDate := '2010-08-06 00:00:00';
SET @endDate := '2010-08-13 23:59:59';

SELECT o.increment_id, e.value as email, ln.value as lastname, fn.value as firstname
FROM sales_order o
INNER JOIN sales_order_varchar e ON e.entity_id = o.entity_id AND attribute_id = @atEmail
INNER JOIN sales_order_entity lne ON lne.parent_id = o.entity_id AND lne.entity_type_id = @etID
INNER JOIN sales_order_entity_varchar ln ON ln.entity_id = lne.entity_id AND ln.attribute_id = @atLn
INNER JOIN sales_order_entity_varchar fn ON fn.entity_id = lne.entity_id AND fn.attribute_id = @atFn
WHERE o.created_at BETWEEN @startDate AND @endDate
GROUP BY o.increment_id
ORDER BY o.increment_id DESC;

The only problem is when the Shipping Name is different than the Billing Name – there’s no control over that, as we’re grouping by increment_id. This could easily be fixed in a PHP script, but the approach there should be different (using collections).

Hope this helps anyone, as it did the trick for me.

Solution tested under Magento

Magento, MySQL ,