Faster or not?

by **whiskyjohn** » Mon May 18, 2015 5:46 am

BillGatesIII wrote:Without any sources to backup my ideas, I'm pretty sure PostgreSQL optimizes queries so it doesn't matter if you write 'A and B and C' or 'C and A and B'. If for example A is used many times in different columns, the optimizer will first check the value of A before checking B and C (and maybe it's doing it parallel, I don't know) and store the result. So in my opinion, the best way to optimize query times is to be consistent with the database fields you use.

And like kraada said, I don't think you will notice any difference. PostgreSQL configuration, hardware and database design and size have a bigger influence on the speed of inserting, updating and retreiving records.

Hmm.

Since I know next to nothing about Postgresql internals, I need to try to comprehend what you are suggesting. To do that,
I need to be more specific. Let's say I have players P1..Pn, Stats S1..Sn(where Sn are column summations), and conditions
C1..Cn(where Cn are column fields). And let's chose a specific stat S1 and specific conditions Ci,Cj,Ck in the form below:

sum(if[
condition C0..Ci-1
condition Ci enum_f_cbet_action = 'C' (51538)
condition Cj enum_t_cbet_action = 'C' (22803)
condition Ck enum_r_cbet_action = 'C' (9578)
condition Ck+1..Cn
, 1, 0])

Now let's consider a specific Player, P1 who has played a million hands. The numbers in ()'s above represent how many times
P1 has acted with the condition. In my initial post, I noted that if I simply reverse the order of summation, I can reduce
the number of comparisions because of the relative frequency of Flop/Turn/River actions.

If we were running a report on a specific player P1, then this reordering would perhaps be faster. But now we need to access
all players in the database. Here is where the questions begin.

Across all players, the order Ci,Cj,Ck in S1 is the same, but of course their values are all different because the players
are all different. Does Postgresql/PT process stats S1..Sn one player at a time before proceeding to the next player?

Likewise, for the same player, the order of Ci,Cj,Ck may be the same in all stats Sn where they are used, but again the values
are all different because the preceding conditions C0..Ci-1 may be different. But perhaps this is not the right way to think
about a summation. That is, proceeding from top to bottom, getting smaller in each successive check, until finally arriving
at the result with the last term in the summation. This is how I traditionally think of it, and why reordering is efficient.

Instead, I suppose all conditions could be counted as true or false for all hands in the database, and then the "intersection"
of various combinations of conditions can produce the final result of summation. If that is true, perhaps this is what you
mean by "If for example A is used many times in different columns, the optimizer will first check the value of A before checking
B and C (and maybe it's doing it parallel, I don't know) and store the result."

Can you comment a bit more please? It is certainly possible that Postgresql is looking at how many times Ci, Cj, and Ck are
referenced across all stats Sn in order to "store the result". But without knowing a priori how one datatype compares to
another datatype, it would seem to require processing each datatype (player in this case) at a time, and also would require
that the summation operation proceed using the intersection-like method.

Anyway, if this is all true and done efficiently, then yes, reordering may have little effect. I may still save it for cases
where one condition is far more rare than everything else, and order it in a way that does not seriously disrupt the
readability of the stat itself.

Thanks for any comments/clarifications.
John

by **kraada** » Mon May 18, 2015 8:27 am

PostgreSQL creates a query plan which is its way of figuring out how to get the query done the fastest. Understanding exactly how it does what it does is pretty complex, but you can check to see what the plans are using EXPLAIN. Here is a 45 minute long video discussing the topic if you're interested.

by **BillGatesIII** » Mon May 18, 2015 11:31 am

With all respect, whatever performance improvement you might achieve (if any), I think it's way too little to spend much energy on.

But, because you asked, here are some answers

whiskyjohn wrote:...In my initial post, I noted that if I simply reverse the order of summation, I can reduce
the number of comparisions because of the relative frequency of Flop/Turn/River actions.

Reversing the order does not mean it will be executed in that order by PostgreSQL. Like I said, it might be a parallel execution or in whatever order the optimizer thinks will be best. Just for the fun, I did rewrite a stat so it looked like this.

Code: Select all: sum(case when enum_r_cbet_action = 'C' then case when enum_t_cbet_action = 'C' then case when enum_f_cbet_action = 'C' then 1 else 0 end else 0 end else 0 end)

I'm not even sure if it did force PostgreSQL to execute from top to bottom because the execution plan and times where the same as for the original stat :shock:

So maybe it always evaluates everything, which I doubt because the documentation says otherwise, I don't know.

whiskyjohn wrote:...Across all players, the order Ci,Cj,Ck in S1 is the same, but of course their values are all different because the players
are all different. Does Postgresql/PT process stats S1..Sn one player at a time before proceeding to the next player?

The short answer is no. The long answer depends on what you're looking at. Retrieving non-cached stats? Updating the cache with the last hand played? Rebuilding the cache? Running a report? Retrieving a cached stat?

whiskyjohn wrote:...Likewise, for the same player, the order of Ci,Cj,Ck may be the same in all stats Sn where they are used, but again the values
are all different because the preceding conditions C0..Ci-1 may be different. But perhaps this is not the right way to think
about a summation. That is, proceeding from top to bottom, getting smaller in each successive check, until finally arriving
at the result with the last term in the summation. This is how I traditionally think of it, and why reordering is efficient.

Instead, I suppose all conditions could be counted as true or false for all hands in the database, and then the "intersection"
of various combinations of conditions can produce the final result of summation. If that is true, perhaps this is what you
mean by "If for example A is used many times in different columns, the optimizer will first check the value of A before checking
B and C (and maybe it's doing it parallel, I don't know) and store the result."

Can you comment a bit more please? It is certainly possible that Postgresql is looking at how many times Ci, Cj, and Ck are
referenced across all stats Sn in order to "store the result". But without knowing a priori how one datatype compares to
another datatype, it would seem to require processing each datatype (player in this case) at a time, and also would require
that the summation operation proceed using the intersection-like method.

This part of your post I do not fully understand. Maybe an example will give you some answers. When PT4 updates the cache or retrieves non cached stats, part of the query will look like this.

Code: Select all: sum(case when condition_a and condition_b and condition_c then 1 else 0 end) as "column_1", sum(case when condition_a and condition_b and condition_d then 1 else 0 end) as "column_2", sum(case when condition_d and condition_e then 1 else 0 end) as "column_3", sum(case when condition_a and not condition_b then 1 else 0 end) as "column_4", sum(case when not condition_a and condition_d and not condition_f then 1 else 0 end) as "column_4"

What I mean by storing (or remembering) a value is that for example if for a specific record the value of condition_a is false, for all of the columns using condition_a the remaining parts within the sum() for those columns don't have to be evaluated. And again, I don't know if PostgreSQL is that smart but it wouldn't suprise me if indeed it is that smart.

whiskyjohn wrote:Anyway, if this is all true and done efficiently, then yes, reordering may have little effect. I may still save it for cases
where one condition is far more rare than everything else, and order it in a way that does not seriously disrupt the
readability of the stat itself.

Thanks for any comments/clarifications.
John

You're welcome, always a pleasure having a conversation with you

by **whiskyjohn** » Mon May 18, 2015 4:25 pm

kraada wrote:PostgreSQL creates a query plan which is its way of figuring out how to get the query done the fastest. Understanding exactly how it does what it does is pretty complex, but you can check to see what the plans are using EXPLAIN. Here is a 45 minute long video discussing the topic if you're interested.

Interesting video, but I am not sure it covered how the query optimizer deals with specific fields inside a table.
From my understanding, it was more about how a Table might be indexed and searched, or joined or merged
with another table. I am still very fuzzy on how Postgresql handles the order of conditions within a summation.
These seem like different concepts to me. But at least I now have some appreciation for planned execution
of database instructions that is going on behind the scenes.

John

by **whiskyjohn** » Mon May 18, 2015 5:56 pm

BillGatesIII wrote:With all respect, whatever performance improvement you might achieve (if any), I think it's way too little to spend much energy on.

But, because you asked, here are some answers

whiskyjohn wrote:...In my initial post, I noted that if I simply reverse the order of summation, I can reduce
the number of comparisions because of the relative frequency of Flop/Turn/River actions.

Reversing the order does not mean it will be executed in that order by PostgreSQL. Like I said, it might be a parallel execution or in whatever order the optimizer thinks will be best. Just for the fun, I did rewrite a stat so it looked like this.
Code: Select all
sum(case when enum_r_cbet_action = 'C' then case when enum_t_cbet_action = 'C' then case when enum_f_cbet_action = 'C' then 1 else 0 end else 0 end else 0 end)

I'm not even sure if it did force PostgreSQL to execute from top to bottom because the execution plan and times where the same as for the original stat So maybe it always evaluates everything, which I doubt because the documentation says otherwise, I don't know.

Regardless of the execution plan, I would not expect to see much change in execution time unless millions of records
are processed. Assuming for a moment that Postgres did execute in the summation order written, the idea is that the
effect would be cumulative. As I wrote in response to kraada, one of the issues I have is that I cannot adequately test
if any of these changes produce a measurable effect because rebuilding the cache is already a long, tiresome process.
I wouldn't mind if a few simple reorderings saved an hour on a 24-hour rebuild but I am not going to run a lot of tests
just to see if anything changed.

BillGatesIII wrote:
whiskyjohn wrote:...Across all players, the order Ci,Cj,Ck in S1 is the same, but of course their values are all different because the players
are all different. Does Postgresql/PT process stats S1..Sn one player at a time before proceeding to the next player?

The short answer is no. The long answer depends on what you're looking at. Retrieving non-cached stats? Updating the cache with the last hand played? Rebuilding the cache? Running a report? Retrieving a cached stat?

For now, I am content just to discuss rebuilding the cache and retreiving a cached stat for the HUD. Is there reason to think that
doing the same for a non-cached stat would execute differently? I have maxed out the number of cached stats so I do use a mix
of both.

BillGatesIII wrote:
whiskyjohn wrote:...Likewise, for the same player, the order of Ci,Cj,Ck may be the same in all stats Sn where they are used, but again the values
are all different because the preceding conditions C0..Ci-1 may be different. But perhaps this is not the right way to think
about a summation. That is, proceeding from top to bottom, getting smaller in each successive check, until finally arriving
at the result with the last term in the summation. This is how I traditionally think of it, and why reordering is efficient.

Instead, I suppose all conditions could be counted as true or false for all hands in the database, and then the "intersection"
of various combinations of conditions can produce the final result of summation. If that is true, perhaps this is what you
mean by "If for example A is used many times in different columns, the optimizer will first check the value of A before checking
B and C (and maybe it's doing it parallel, I don't know) and store the result."

Can you comment a bit more please? It is certainly possible that Postgresql is looking at how many times Ci, Cj, and Ck are
referenced across all stats Sn in order to "store the result". But without knowing a priori how one datatype compares to
another datatype, it would seem to require processing each datatype (player in this case) at a time, and also would require
that the summation operation proceed using the intersection-like method.

This part of your post I do not fully understand.

Well, I would like to find a more clear way to describe what is going on, but I would need to know more about Postgresql to do it.

You are basically saying that don't count on conditions to be executed in the order that I write them. I watched the video that
kraada referenced and there the concepts were more about how Postgresql decides the best way to search though database
records, depending on if a table is indexed or not, it's size, and other pieces of information. And the other part of the video
was about how tables could be merged.

These are all important concepts to managing database queries but it does not seem to address what I am after with summation
order. If I know a priori that Ck is true an order of magnitude less than Cj, and Cj is true an order of magnitude less than Ci, then
the best way to execute these conditions is Ck->Cj->Ci and not Ci->Cj->Ck. However, this assumes some kind of sequential search
of the records that reference the specific fields needed for comparisions in Ci, Cj, and Ck. This sequential search also requires
that the number of times Ci, Cj, and Ck are true depends on preceding conditions C0...Ci-1.

What I was trying to describe above is that one could view the entire summation like a kind of Venn Diagram and rather than
starting with condition C0, followed by C1, followed by C2... where evaluation of successive conditions produces fewer and
fewer matching records; instead the "intersection" of all conditions which are true produces the sum. Perhaps it is easier to
think of what I mean if I said color all C0s blue, all C1s green, all C2s yellow, etc. The sum is now all the records of the
database which are "painted" blue & green & yellow(like a bitmask if you know that construction). And furthermore, if I know
C0 is a frequently used condition, I can remember to leave all the C0 records painted blue when I go on to sum fields of another stat.

Now all this would be taking place with some temporary index or construction that requires more detail. That would be the
optimizers job. But I am not yet convinced such a plan is being constructed behind the scenes.

BillGatesIII wrote:Maybe an example will give you some answers. When PT4 updates the cache or retrieves non cached stats, part of the query will look like this.

Code: Select all
sum(case when condition_a and condition_b and condition_c then 1 else 0 end) as "column_1", sum(case when condition_a and condition_b and condition_d then 1 else 0 end) as "column_2", sum(case when condition_d and condition_e then 1 else 0 end) as "column_3", sum(case when condition_a and not condition_b then 1 else 0 end) as "column_4", sum(case when not condition_a and condition_d and not condition_f then 1 else 0 end) as "column_4"

What I mean by storing (or remembering) a value is that for example if for a specific record the value of condition_a is false, for all of the columns using condition_a the remaining parts within the sum() for those columns don't have to be evaluated. And again, I don't know if PostgreSQL is that smart but it wouldn't suprise me if indeed it is that smart.

Well now here I don't fully understand what you say, but perhaps we are saying the same thing. Remembering/storing that "condition_a is false",
to quote your text above, is the same thing I mean when I say color C0 blue so that if C0 is referenced again, all records for which it is true(or false)
are already colored blue.

I'm not convinced Postgresql is so "smart", however, mainly because the optimizations I learned about were not being done at this level.

BTW, is a full vaccuum/analyze helpful in planning a better optimization path? Is there something more that can be done?

BillGatesIII wrote:
whiskyjohn wrote:Thanks for any comments/clarifications.
John

You're welcome, always a pleasure having a conversation with you

Thanks. I don't really look on any of this as a waste of time. I am actually far more interested in the programming than the actual
poker playing. So learning more details about Postgresql, and how to design stats, and better mine the database is all of use to me.

John

by **BillGatesIII** » Tue May 19, 2015 3:41 am

Ok. Let's wrap this up. From bottom to top.

- If you want to know more about PostgreSQL internals, you will like this article, especially with your C background.
http://patshaughnessy.net/2014/10/13/following-a-select-statement-through-postgres-internals

- Yes, we do mean the same thing with remembering values and referencing the same color. What we have to find out is how 'smart' PostgreSQL is, I'll might run some tests later and let you know the results.

- Rebuilding the custom cache takes an immense amount of time if you have a lot of hands in your database and a lot of custom stats. I have written my own 'custom cache' update SQL queries which will only update the database fields that needs to be updated so it takes minutes instead of hours. The script is too rudimentary and complex to share at the moment. I think this is the way PT4 should handle changes in cached columns, maybe they will implement something like this in the future.

by **whiskyjohn** » Thu May 28, 2015 9:37 am

BillGatesIII wrote:Ok. Let's wrap this up. From bottom to top.

- If you want to know more about PostgreSQL internals, you will like this article, especially with your C background.
http://patshaughnessy.net/2014/10/13/following-a-select-statement-through-postgres-internals

Sorry for the delay in responding. That link was interesting, but pretty much confirms all the nightmarish things Postgresql does behind
the scenes. In my earlier C programming days, I implented much simpler datastructures to index and sort dynamic sets of data. Later,
I implemented some more advanced datasets using Perl and rudimentary database ties to read/write the data on disk. My boss once
suggested we use an ODBC compliant database like Postgresql because it was a selling point. I rejected that idea outright because all
we needed to do from a database perspective was accomplished in a few hundred lines of Perl code. Frankly, Pokertracker is not all
that more complex.

Had I the time, I could have written something to turn hand histories into reports much much faster while using far less space on disk.
However, writing the user interface, the graphics code, and the HUD all require a great deal of time and finesse.

BillGatesIII wrote:- Yes, we do mean the same thing with remembering values and referencing the same color. What we have to find out is how 'smart' PostgreSQL is, I'll might run some tests later and let you know the results.

I reordered some of my custom stats and rebuilt a test database of a few million hands. I can't say I noticed any speed difference in importing
hands or rebuilding the cache. But my machine had load from other running programs as well. I didn't perform this test in a vacuum. I think in order
to get a straight answer to my questions abort sorting and ordering I would need to hang out on postgresql forums. Maybe one day.

BillGatesIII wrote:- Rebuilding the custom cache takes an immense amount of time if you have a lot of hands in your database and a lot of custom stats. I have written my own 'custom cache' update SQL queries which will only update the database fields that needs to be updated so it takes minutes instead of hours. The script is too rudimentary and complex to share at the moment. I think this is the way PT4 should handle changes in cached columns, maybe they will implement something like this in the future.

You would think with the ridiculously long time it takes to rebuild a cache that some improvements could be made. It does me little good to report bugs if,
whenever PT eventually gets around to releasing a fix, that I need to reimport all my hands and rebuild the cache. This could takes days and days!

For example, I submitted a bug about hands which could be considered donks. It was agreed this was a bug so some future release of PT should fix it.
Now, how should I go about updating my database? 99.9% of hands won't be affected by this bug. So one could argue not to worry about reimporting
those hands anyway.

But what other choice to I have? I am not even sure what happens if I try to reimport a hand which is already in the database. I know I get a "duplicate" count,
but are any of the fields updated to reflect a change (donk or not a donk)? If not, then I need to somehow identify all hands to purge (another nightmare),
purge them, reimport them, and rebuild the entire custom cache. Ridiculous!

John

by **kraada** » Thu May 28, 2015 2:29 pm

When a hand is detected as a duplicate no data in the database is changed. If you look at the logs, what we actually do behind the scenes is use COPY to attempt to insert the data, and it is rejected when there is a duplicate hand and the COPY fails.

by **whiskyjohn** » Thu May 28, 2015 3:00 pm

kraada wrote:When a hand is detected as a duplicate no data in the database is changed. If you look at the logs, what we actually do behind the scenes is use COPY to attempt to insert the data, and it is rejected when there is a duplicate hand and the COPY fails.

From the way you are describing it, a change in a database field is not considered a change in a database record, and therefore no change is
made to the database. Am I wrong on this?

So if there is a bug fix in a PT that requires hands to be reimported, how do you suggest identifying those hands and updating the database?
In general, I won't know exactly what hands will be affected by a bug fix so identifying and purging does not work. And reimporting also does
not work because the hand is considered a duplicate even though some database field(s) would change. So the only other choice is rebuild the
database from scratch, or simply ignore the problem and let new hands be imported with different field settings than prior ones. Neither of
those alternatives is pleasant!

John

by **kraada** » Thu May 28, 2015 3:30 pm

There isn't a method to purge a single hand - the smallest purgeable item is a session. You could identify sessions containing those hands and then purge those sessions via Database Management -> Purge Hands and reimport the hands at that point. It really depends how common it is whether that's feasible or not.

Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Re: Faster or not?

Who is online

14 day free trial
download now

PT4 Supported Sites

PT4 Videos

Press Center

follow us on