Alternatives to PHP + MYSQL

TravisVOX · 04-05-13 11:28 AM

I wonder if anyone here could provide me with some guidance.

I've been working on a robust database for horse race handicapping now for over a year using PHP and MYSQL as the core languages to run it. I've stored it on my local computer, on Amazon servers, on hosting servers etc., all of which ultimately work, but are slow.

I've been privy to some other setups that are more locally based using such software as Visual Basic, Fox Pro etc. which are SIGNIFICANTLY faster. I'm pretty sure I've optimized my setup as much as possible, long story short, I would like to explore some language/database options which are faster than the PHP + MySQL combo.

Any inputs/experiences/advice?

Right now, to simulate or analyze say 100,000 events can take a long time, longer than it should and longer than I've experienced with other people's setups.

FWIW - I own a Mac, but don't necessarily have to use it for my database/model... could use my Windows PC.

Richards · 04-05-13 01:32 PM

You might find other languages/databases, but depending on what you mean by slow, your time would probably be more efficiently spent (learning how to) aggressively optimize your existing database, than learning and rewriting in another language.

If it were me, this is what I would do, and apologies if you've done any of this already, I know you said you optimized already but you didn't provide detail:

* Remember the 80/20 rule. 80% of the time taken is going to be in 20% or less of your code.
* Figure out where the bottleneck is. Is the query in MySQL running like lightning, and PHP is taking forever, or is the PHP script finishing in 0.25 seconds and the query is taking forever? My guess is the latter. If you're not really careful about how you write queries they will be slow no matter what DB you are using.
* If it's MySQL, use some of the built in query profiling tools, or maybe even a 3rd party tool to see exactly what and why it's taking a long time. This is a very common problem so there are lots of tools out there.
* If it's PHP, check the docs for inserting some profiling code into your script so you can see where you are spending all your time.
* Once you find out where the problem area is, see if you can rewrite it for a speedup.

I bet there's a query or three that are taking up all the time.

If you get that far post here and I or others can probably help speed it up.

Only after I'd addressed this "low hanging fruit" would I even consider trying to take on porting to another platform.

If you rewrite in another language, well, at least you'll learn something new. Good luck!

FreeFall · 04-05-13 03:31 PM

You've written a vague post so it's hard to specifically help you. I'll give you a broad answer.

why not throw hardware at the problem? It's pretty cheap...

Maverick22 · 04-06-13 01:13 PM

Are your tables indexed? It would do wonders for you performance wise to use a few more column indexes.

Blax0r · 04-08-13 10:20 AM

Originally Posted by Maverick22

Are your tables indexed? It would do wonders for you performance wise to use a few more column indexes.

This!

TravisVOX · 04-08-13 11:09 PM

Thanks for the feedback. I am going through and looking at the queries and indexes to see where I can optimize more. I gave up on the silly idea of moving to a new language.

I think what is most challenging is the sheer depth of horse racing data. I've become frustrated by how long it takes to update one column which I use as a core coefficient my model.

339955 · 04-09-13 02:14 PM

maverick, could you explain your idea bit more please?

travis, you are updating DB columns each time you try to model a result? that seems pretty slow can you do it another way?

TravisVOX · 04-09-13 09:53 PM

I have a table for the ultimate output of the model. For the first version of the model, I have four columns with the key variables in the formulas.

Im trying to avoid looping through each and every line to update and keep the calculation within MySQL. If someone could offer up a better strategy for storing model output, I'm all ears as getting this right is obviously important.

yak merchant · 04-09-13 11:41 PM

Originally Posted by TravisVOX

I have a table for the ultimate output of the model. For the first version of the model, I have four columns with the key variables in the formulas.

Im trying to avoid looping through each and every line to update and keep the calculation within MySQL. If someone could offer up a better strategy for storing model output, I'm all ears as getting this right is obviously important.

Well I'm a SQL Server guy, so I might be barking up the wrong tree, but first and foremost are you using temporary(in memory) tables any where? Each write to the actual table is probably the slowest part even if you have optimized indexes. So a lot of times I insert the model results into a temporary table and then update your main table all at once at then conclusion of the processing. That being said, your comment to "keep the calculation in MySQL" works up until a point and then that becomes a bottleneck. I'm still trying to get "calculation logic" out of my database so I can parallelize/distribute the computing. Anyway good luck, and just wait until you start trying to store large datasets with the results of multiple models and back test across multiple years, the database optimization game get's even more fun....

MonkeyF0cker · 04-10-13 04:32 AM

Originally Posted by TravisVOX

I have a table for the ultimate output of the model. For the first version of the model, I have four columns with the key variables in the formulas.

Im trying to avoid looping through each and every line to update and keep the calculation within MySQL. If someone could offer up a better strategy for storing model output, I'm all ears as getting this right is obviously important.

Uhh. Why wouldn't you be manipulating the data in PHP?

TravisVOX · 04-22-13 04:26 PM

I've been trying to work this stuff through a bit more.

For example, in one race there are 10 horses. Let's say I have 20 factors to calculate/determine for each horse. So I do that (calling upon dozens of recent races for each horse) and update the temporary table.

Then, there are more factors that use the 20 factors I just created. So I calculate those. Then more are created depending on the ones just created.... and the cycle continues.

I could, theoretically, just store each phase of the process to the database and reload. Obviously, that's crazy in terms of resource consumption. So, I'm trying to figure out the best way to optimize the speed of the process. Keep in mind, I'm trying to loop through 200,000 races or so and I haven't even made it to the step of handicapping the non-out-of-sample data.

MonkeyF0cker · 04-22-13 10:39 PM

No. You store them into variables in PHP (or whatever language you're using) and manipulate the data there.

There is almost no reason that you should be creating temporary tables.

I would suggest reading a book or two on programming before you attempt to go any further.

TravisVOX · 04-23-13 07:29 AM

Originally Posted by MonkeyF0cker

No. You store them into variables in PHP (or whatever language you're using) and manipulate the data there.

There is almost no reason that you should be creating temporary tables.

I would suggest reading a book or two on programming before you attempt to go any further.

No need to read a book or two. I understand. The whole thrust of this was to minimize the numer of hard disk reads/writes to the database which by themselves aren't bad, but when you carry it out hundreds of thousands of times become cumbersome. And then with PHP, you reach a memory limit rather quickly, even if you increase the default setting.

Richards · 04-23-13 10:46 AM

Originally Posted by TravisVOX

No need to read a book or two. I understand. The whole thrust of this was to minimize the numer of hard disk reads/writes to the database which by themselves aren't bad, but when you carry it out hundreds of thousands of times become cumbersome. And then with PHP, you reach a memory limit rather quickly, even if you increase the default setting.

Obviously I could be missing some implementation details, but is there a reason the data needs to be recalculated every time for historical data, I mean are you summing factors in columns and including new race data, or are you recalculating historical data each time? If so why not store at least partial calculations into a table or view, instead of throwing it out everytime?

I realize you'd be trading a CPU calculation for disk read from the view/table but you're probably having to do a lot of I/O anyway setting up temp tables everytime doing calculations/storing the result into memory or disk, etc.

P.S. If it were me I'd start with profiling the querys and the PHP script before spending time optimizing. Many times it's a programming error or wrong assumption causing 1 query or function to take 90% of the cpu time.

MonkeyF0cker · 04-23-13 04:01 PM

Originally Posted by TravisVOX

No need to read a book or two. I understand. The whole thrust of this was to minimize the numer of hard disk reads/writes to the database which by themselves aren't bad, but when you carry it out hundreds of thousands of times become cumbersome. And then with PHP, you reach a memory limit rather quickly, even if you increase the default setting.

LOL. Minimize the number of HDD read/writes? YOU'RE MAXIMIZING THEM. Do you not understand that temporary tables are paged on the hard drive?

You most definitely need to read a book or two on programming and SQL, but nevermind me, I'm only a programmer.

You can dispose of variables (IN ANY LANGUAGE!!!!!!) if you're running into memory issues. There's no possible way that you need to load the entire table/database/whatever at once. PHP is perfectly capable of doing whatever you're trying to do as are a vast number of other languages. It would most certainly help for you to understand the language you're working with first though.

Blax0r · 04-23-13 05:16 PM

Originally Posted by TravisVOX

I've been trying to work this stuff through a bit more.

For example, in one race there are 10 horses. Let's say I have 20 factors to calculate/determine for each horse. So I do that (calling upon dozens of recent races for each horse) and update the temporary table.

Then, there are more factors that use the 20 factors I just created. So I calculate those. Then more are created depending on the ones just created.... and the cycle continues.

I could, theoretically, just store each phase of the process to the database and reload. Obviously, that's crazy in terms of resource consumption. So, I'm trying to figure out the best way to optimize the speed of the process. Keep in mind, I'm trying to loop through 200,000 races or so and I haven't even made it to the step of handicapping the non-out-of-sample data.

At first, I was going to dismiss this since a highly reputable member is tearing you apart, but I think I see what you're trying to convey. You have proprietary, calculated statistics that must be computed serially (I have a good guess on what that stat is lol) because they depend the previous race's calculated stats.

Here's my suggestion: Compute those factors over ALL races (ie, take a big time hit at first), and save them to the database (join on race id, horse id, etc). As new races are brought up, you can easily pull in the necessary data to execute your model (in PHP) since the analytics are already stored on your db.

MonkeyF0cker · 04-23-13 08:31 PM

Well, that's the point. You don't need a table to store those variables unless...

1. You don't know what you're doing.
2. Your database is horrendously designed. See point #1.

You should be perfectly able to manipulate the data through queries and PHP variables.

Blax0r · 04-24-13 09:42 AM

Originally Posted by MonkeyF0cker

Well, that's the point. You don't need a table to store those variables unless...

1. You don't know what you're doing.
2. Your database is horrendously designed. See point #1.

You should be perfectly able to manipulate the data through queries and PHP variables.

I think the statistics he's trying to compute has a huge time-cost to generate per race; ie, he has to loop through each horses/riders/whatever's entire history, do some calculation, and use that output as an input for the following races. One example would be some crazy decision tree that is "path-dependent", meaning he needs some representation of the route taken to reach the current node. You can definitely pull in the necessary data and do the entire computation in php, but at 200,000 historical samples (according to Travis), it'll be slow.

He was trying to minimize this cost through various channels, but it may ultimately be best to just store those stats (but do the actual model output for "live" races in php). If you have a better solution, I would be interested to hear it as well, since I have a similar problem.

MonkeyF0cker · 04-24-13 01:24 PM

Well, there are a number of ways it can be broken down... Class, distance, surface, track, horse, etc. The histories of each horse are relatively small compared to something like an MLB pitch by pitch database. Rarely is there a horse with a race history of over 200 races. Pull the independent variables from the database and manipulate the dependent variables in PHP. If those dependent variables are intermediary and used for calculating other dependent variables, then you'll have to code some sort of updating mechanism and store them (as you said). He's updating a temporary table after running the history for each horse though. If he's in the process of calculating a dependent or intermediary variable, there's no reason to update a table at that point. It's a huge waste of resources. In any case, it needs to be broken into steps. Some calculations can take hours especially some things that are difficult to quantify (like the class of a horse). It sounds to me like he's trying to do everything at once.

Blax0r · 04-24-13 01:48 PM

Originally Posted by MonkeyF0cker

Well, there are a number of ways it can be broken down... Class, distance, surface, track, horse, etc. The histories of each horse are relatively small compared to something like an MLB pitch by pitch database. Rarely is there a horse with a race history of over 200 races. Pull the independent variables from the database and manipulate the dependent variables in PHP. If those dependent variables are intermediary and used for calculating other dependent variables, then you'll have to code some sort of updating mechanism and store them (as you said). He's updating a temporary table after running the history for each horse though. If he's in the process of calculating a dependent or intermediary variable, there's no reason to update a table at that point. It's a huge waste of resources. In any case, it needs to be broken into steps. Some calculations can take hours especially some things that are difficult to quantify (like the class of a horse). It sounds to me like he's trying to do everything at once.

I definitely agree on the temp tables; there's not really a good reason to use them at the production-level code-base. Unfortunately, it sounds like I can't improve my own process (I more or less do what you're saying for tennis), but this is good stuff; I do hope the original poster is reading this.

Thanks for the insight.

TravisVOX · 04-26-13 07:57 AM

Originally Posted by Blax0r

I think the statistics he's trying to compute has a huge time-cost to generate per race; ie, he has to loop through each horses/riders/whatever's entire history, do some calculation, and use that output as an input for the following races. One example would be some crazy decision tree that is "path-dependent", meaning he needs some representation of the route taken to reach the current node. You can definitely pull in the necessary data and do the entire computation in php, but at 200,000 historical samples (according to Travis), it'll be slow.

He was trying to minimize this cost through various channels, but it may ultimately be best to just store those stats (but do the actual model output for "live" races in php). If you have a better solution, I would be interested to hear it as well, since I have a similar problem.

This summarizes it pretty well. I don't worry about being torn apart on the internet, where everyone is tough. My database is fairly normalized. I have a few repeat points of data to avoid unnecessary joins as the approach is more analytical. For this year alone there are 40,000 horses with 90,000 starts and 538,000 past races amongst them. It's just a lot of data. Others who have done this use up to 120 variables per horse, per race!

Originally Posted by MonkeyF0cker

Pull the independent variables from the database and manipulate the dependent variables in PHP. If those dependent variables are intermediary and used for calculating other dependent variables, then you'll have to code some sort of updating mechanism and store them (as you said). He's updating a temporary table after running the history for each horse though. If he's in the process of calculating a dependent or intermediary variable, there's no reason to update a table at that point. It's a huge waste of resources. In any case, it needs to be broken into steps. Some calculations can take hours especially some things that are difficult to quantify (like the class of a horse). It sounds to me like he's trying to do everything at once.

I never sent an update to the database after every horse in PHP. At first I did that with trying to do calculations via queries. Then I brought those into PHP and I batched the process to do 100 races (~800 horses) at once, increasing that number to maximize memory. It just always felt like it was taking too long. I'm fine with dumping temp tables, as I've never used them but read to where being stored in memory (on the database server) allowed them to move fast.

At any rate, I follow the feedback/thoughts here and will start to implement. Thanks all!

MonkeyF0cker · 04-26-13 03:47 PM

Some of my class code for horses takes up to 13 hours to run locally with MS-SQL and C# from scratch. That's only for quantifying class. If you organize your data so that you give yourself the ability to incrementally update it with additional data, you won't need to continuously go through that process. You just need to update the affected variables. Rather than running a process that takes that long each night, it only takes about 10 minutes for me this way.

Since it sounds like you're just starting, the difficulty at the beginning is processing all of the data and finding the most predictive variables, methods, etc. It's certainly time consuming, but once you've found the best ways to approach certain tasks, you can create a shortcut of sorts like I mentioned above. Unfortunately, until then, there really aren't any shortcuts.

jfreeman · 04-26-13 04:40 PM

You might want to look into Postgres. I saw huge improvements in simulation times when I moved from a mySQL-based database and carefully indexed my tables (pay attention to clustered vs. non-clustered indexes on each table). Lots of optimization at the lower levels that's taken care of for you.

chachi · 05-01-13 03:57 PM

Originally Posted by jfreeman

You might want to look into Postgres. I saw huge improvements in simulation times when I moved from a mySQL-based database and carefully indexed my tables (pay attention to clustered vs. non-clustered indexes on each table). Lots of optimization at the lower levels that's taken care of for you.

THIS ... I was surprised when I transferred some framework from MySQL to PGSQL when running comparative query timings

TravisVOX · 05-15-13 11:12 PM

Just wanted to provide an update to this project. The main takeaway I had was, no surprise, to bring the heavy calculating and processing into PHP rather than trying to avoid that with complex queries.

I was able to improve performance a bit, but ultimately made one key move that made a huge difference. I basically turned my back to a pure relational table setup and focused on an analytical setup. I have a lot of repeat data in certain areas but it's lightning quick. The database uses a lot of hard drive space, but it can handicap, right now, about 58 races per second. That is I'd say 58x faster than what it was a few weeks ago.

So thanks all for the input... we're off and running now!

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

Alternatives to PHP + MYSQL

Thread Tools

Alternatives to PHP + MYSQL