FastBernoulliTrial.h 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379
  1. /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
  2. /* vim: set ts=8 sts=2 et sw=2 tw=80: */
  3. /* This Source Code Form is subject to the terms of the Mozilla Public
  4. * License, v. 2.0. If a copy of the MPL was not distributed with this
  5. * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
  6. #ifndef mozilla_FastBernoulliTrial_h
  7. #define mozilla_FastBernoulliTrial_h
  8. #include "mozilla/Assertions.h"
  9. #include "mozilla/XorShift128PlusRNG.h"
  10. #include <cmath>
  11. #include <stdint.h>
  12. namespace mozilla {
  13. /**
  14. * class FastBernoulliTrial: Efficient sampling with uniform probability
  15. *
  16. * When gathering statistics about a program's behavior, we may be observing
  17. * events that occur very frequently (e.g., function calls or memory
  18. * allocations) and we may be gathering information that is somewhat expensive
  19. * to produce (e.g., call stacks). Sampling all the events could have a
  20. * significant impact on the program's performance.
  21. *
  22. * Why not just sample every N'th event? This technique is called "systematic
  23. * sampling"; it's simple and efficient, and it's fine if we imagine a
  24. * patternless stream of events. But what if we're sampling allocations, and the
  25. * program happens to have a loop where each iteration does exactly N
  26. * allocations? You would end up sampling the same allocation every time through
  27. * the loop; the entire rest of the loop becomes invisible to your measurements!
  28. * More generally, if each iteration does M allocations, and M and N have any
  29. * common divisor at all, most allocation sites will never be sampled. If
  30. * they're both even, say, the odd-numbered allocations disappear from your
  31. * results.
  32. *
  33. * Ideally, we'd like each event to have some probability P of being sampled,
  34. * independent of its neighbors and of its position in the sequence. This is
  35. * called "Bernoulli sampling", and it doesn't suffer from any of the problems
  36. * mentioned above.
  37. *
  38. * One disadvantage of Bernoulli sampling is that you can't be sure exactly how
  39. * many samples you'll get: technically, it's possible that you might sample
  40. * none of them, or all of them. But if the number of events N is large, these
  41. * aren't likely outcomes; you can generally expect somewhere around P * N
  42. * events to be sampled.
  43. *
  44. * The other disadvantage of Bernoulli sampling is that you have to generate a
  45. * random number for every event, which can be slow.
  46. *
  47. * [significant pause]
  48. *
  49. * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli
  50. * sampling, while generating a fresh random number only when we do decide to
  51. * sample an event, not on every trial. When it decides not to sample, a call to
  52. * |FastBernoulliTrial::trial| is nothing but decrementing a counter and
  53. * comparing it to zero. So the lower your sampling probability is, the less
  54. * overhead FastBernoulliTrial imposes.
  55. *
  56. * Probabilities of 0 and 1 are handled efficiently. (In neither case need we
  57. * ever generate a random number at all.)
  58. *
  59. * The essential API:
  60. *
  61. * - FastBernoulliTrial(double P)
  62. * Construct an instance that selects events with probability P.
  63. *
  64. * - FastBernoulliTrial::trial()
  65. * Return true with probability P. Call this each time an event occurs, to
  66. * decide whether to sample it or not.
  67. *
  68. * - FastBernoulliTrial::trial(size_t n)
  69. * Equivalent to calling trial() |n| times, and returning true if any of those
  70. * calls do. However, like trial, this runs in fast constant time.
  71. *
  72. * What is this good for? In some applications, some events are "bigger" than
  73. * others. For example, large allocations are more significant than small
  74. * allocations. Perhaps we'd like to imagine that we're drawing allocations
  75. * from a stream of bytes, and performing a separate Bernoulli trial on every
  76. * byte from the stream. We can accomplish this by calling |t.trial(S)| for
  77. * the number of bytes S, and sampling the event if that returns true.
  78. *
  79. * Of course, this style of sampling needs to be paired with analysis and
  80. * presentation that makes the size of the event apparent, lest trials with
  81. * large values for |n| appear to be indistinguishable from those with small
  82. * values for |n|.
  83. */
  84. class FastBernoulliTrial {
  85. /*
  86. * This comment should just read, "Generate skip counts with a geometric
  87. * distribution", and leave everyone to go look that up and see why it's the
  88. * right thing to do, if they don't know already.
  89. *
  90. * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE...
  91. *
  92. * Instead of generating a fresh random number for every trial, we can
  93. * randomly generate a count of how many times we should return false before
  94. * the next time we return true. We call this a "skip count". Once we've
  95. * returned true, we generate a fresh skip count, and begin counting down
  96. * again.
  97. *
  98. * Here's an awesome fact: by exercising a little care in the way we generate
  99. * skip counts, we can produce results indistinguishable from those we would
  100. * get "rolling the dice" afresh for every trial.
  101. *
  102. * In short, skip counts in Bernoulli trials of probability P obey a geometric
  103. * distribution. If a random variable X is uniformly distributed from [0..1),
  104. * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric
  105. * distribution for the skip counts.
  106. *
  107. * Why that formula?
  108. *
  109. * Suppose we're to return |true| with some probability P, say, 0.3. Spread
  110. * all possible futures along a line segment of length 1. In portion P of
  111. * those cases, we'll return true on the next call to |trial|; the skip count
  112. * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more.
  113. *
  114. * skip: 0 1 or more
  115. * |------------------^-----------------------------------------|
  116. * portion: 0.3 0.7
  117. * P 1-P
  118. *
  119. * But the "1 or more" section of the line is subdivided the same way: *within
  120. * that section*, in portion P the second call to |trial()| returns true, and in
  121. * portion 1-P it returns false a second time; the skip count is two or more.
  122. * So we return true on the second call in proportion 0.7 * 0.3, and skip at
  123. * least the first two in proportion 0.7 * 0.7.
  124. *
  125. * skip: 0 1 2 or more
  126. * |------------------^------------^----------------------------|
  127. * portion: 0.3 0.7 * 0.3 0.7 * 0.7
  128. * P (1-P)*P (1-P)^2
  129. *
  130. * We can continue to subdivide:
  131. *
  132. * skip >= 0: |------------------------------------------------- (1-P)^0 --|
  133. * skip >= 1: | ------------------------------- (1-P)^1 --|
  134. * skip >= 2: | ------------------ (1-P)^2 --|
  135. * skip >= 3: | ^ ---------- (1-P)^3 --|
  136. * skip >= 4: | . --- (1-P)^4 --|
  137. * .
  138. * ^X, see below
  139. *
  140. * In other words, the likelihood of the next n calls to |trial| returning
  141. * false is (1-P)^n. The longer a run we require, the more the likelihood
  142. * drops. Further calls may return false too, but this is the probability
  143. * we'll skip at least n.
  144. *
  145. * This is interesting, because we can pick a point along this line segment
  146. * and see which skip count's range it falls within; the point X above, for
  147. * example, is within the ">= 2" range, but not within the ">= 3" range, so it
  148. * designates a skip count of 2. So if we pick points on the line at random
  149. * and use the skip counts they fall under, that will be indistinguishable
  150. * from generating a fresh random number between 0 and 1 for each trial and
  151. * comparing it to P.
  152. *
  153. * So to find the skip count for a point X, we must ask: To what whole power
  154. * must we raise 1-P such that we include X, but the next power would exclude
  155. * it? This is exactly std::floor(std::log(X) / std::log(1-P)).
  156. *
  157. * Our algorithm is then, simply: When constructed, compute an initial skip
  158. * count. Return false from |trial| that many times, and then compute a new skip
  159. * count.
  160. *
  161. * For a call to |trial(n)|, if the skip count is greater than n, return false
  162. * and subtract n from the skip count. If the skip count is less than n,
  163. * return true and compute a new skip count. Since each trial is independent,
  164. * it doesn't matter by how much n overshoots the skip count; we can actually
  165. * compute a new skip count at *any* time without affecting the distribution.
  166. * This is really beautiful.
  167. */
  168. public:
  169. /**
  170. * Construct a fast Bernoulli trial generator. Calls to |trial()| return true
  171. * with probability |aProbability|. Use |aState0| and |aState1| to seed the
  172. * random number generator; both may not be zero.
  173. */
  174. FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1)
  175. : mProbability(0)
  176. , mInvLogNotProbability(0)
  177. , mGenerator(aState0, aState1)
  178. , mSkipCount(0)
  179. {
  180. setProbability(aProbability);
  181. }
  182. /**
  183. * Return true with probability |mProbability|. Call this each time an event
  184. * occurs, to decide whether to sample it or not. The lower |mProbability| is,
  185. * the faster this function runs.
  186. */
  187. bool trial() {
  188. if (mSkipCount) {
  189. mSkipCount--;
  190. return false;
  191. }
  192. return chooseSkipCount();
  193. }
  194. /**
  195. * Equivalent to calling trial() |n| times, and returning true if any of those
  196. * calls do. However, like trial, this runs in fast constant time.
  197. *
  198. * What is this good for? In some applications, some events are "bigger" than
  199. * others. For example, large allocations are more significant than small
  200. * allocations. Perhaps we'd like to imagine that we're drawing allocations
  201. * from a stream of bytes, and performing a separate Bernoulli trial on every
  202. * byte from the stream. We can accomplish this by calling |t.trial(S)| for
  203. * the number of bytes S, and sampling the event if that returns true.
  204. *
  205. * Of course, this style of sampling needs to be paired with analysis and
  206. * presentation that makes the "size" of the event apparent, lest trials with
  207. * large values for |n| appear to be indistinguishable from those with small
  208. * values for |n|, despite being potentially much more likely to be sampled.
  209. */
  210. bool trial(size_t aCount) {
  211. if (mSkipCount > aCount) {
  212. mSkipCount -= aCount;
  213. return false;
  214. }
  215. return chooseSkipCount();
  216. }
  217. void setRandomState(uint64_t aState0, uint64_t aState1) {
  218. mGenerator.setState(aState0, aState1);
  219. }
  220. void setProbability(double aProbability) {
  221. MOZ_ASSERT(0 <= aProbability && aProbability <= 1);
  222. mProbability = aProbability;
  223. if (0 < mProbability && mProbability < 1) {
  224. /*
  225. * Let's look carefully at how this calculation plays out in floating-
  226. * point arithmetic. We'll assume IEEE, but the final C++ code we arrive
  227. * at would still be fine if our numbers were mathematically perfect. So,
  228. * while we've considered IEEE's edge cases, we haven't done anything that
  229. * should be actively bad when using other representations.
  230. *
  231. * (In the below, read comparisons as exact mathematical comparisons: when
  232. * we say something "equals 1", that means it's exactly equal to 1. We
  233. * treat approximation using intervals with open boundaries: saying a
  234. * value is in (0,1) doesn't specify how close to 0 or 1 the value gets.
  235. * When we use closed boundaries like [2**-53, 1], we're careful to ensure
  236. * the boundary values are actually representable.)
  237. *
  238. * - After the comparison above, we know mProbability is in (0,1).
  239. *
  240. * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53].
  241. *
  242. * - Because the floating-point gaps near 1 are wider than those near
  243. * zero, there are many small positive doubles ε such that 1-ε rounds to
  244. * exactly 1. However, 2**-53 can be represented exactly. So
  245. * 1-mProbability is in [2**-53, 1].
  246. *
  247. * - log(1 - mProbability) is thus in (-37, 0].
  248. *
  249. * That range includes zero, but when we use mInvLogNotProbability, it
  250. * would be helpful if we could trust that it's negative. So when log(1
  251. * - mProbability) is 0, we'll just set mProbability to 0, so that
  252. * mInvLogNotProbability is not used in chooseSkipCount.
  253. *
  254. * - How much of the range of mProbability does this cause us to ignore?
  255. * The only value for which log returns 0 is exactly 1; the slope of log
  256. * at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε,
  257. * never 0. The gaps near one are larger than the gaps near zero, so if
  258. * 1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability)
  259. * isn't 0, then 1 - mProbability isn't 1, which means that mProbability
  260. * is at least 2**-53, as discussed earlier. This is a sampling
  261. * likelihood of roughly one in ten trillion, which is unlikely to be
  262. * distinguishable from zero in practice.
  263. *
  264. * So by forbidding zero, we've tightened our range to (-37, -2**-53].
  265. *
  266. * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all
  267. * falls readily within the range of an IEEE double.
  268. *
  269. * ALL THAT HAVING BEEN SAID: here are the five lines of actual code:
  270. */
  271. double logNotProbability = std::log(1 - mProbability);
  272. if (logNotProbability == 0.0)
  273. mProbability = 0.0;
  274. else
  275. mInvLogNotProbability = 1 / logNotProbability;
  276. }
  277. chooseSkipCount();
  278. }
  279. private:
  280. /* The likelihood that any given call to |trial| should return true. */
  281. double mProbability;
  282. /*
  283. * The value of 1/std::log(1 - mProbability), cached for repeated use.
  284. *
  285. * If mProbability is exactly 0 or exactly 1, we don't use this value.
  286. * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e.
  287. * definitely negative, as required by chooseSkipCount. See setProbability for
  288. * the details.
  289. */
  290. double mInvLogNotProbability;
  291. /* Our random number generator. */
  292. non_crypto::XorShift128PlusRNG mGenerator;
  293. /* The number of times |trial| should return false before next returning true. */
  294. size_t mSkipCount;
  295. /*
  296. * Choose the next skip count. This also returns the value that |trial| should
  297. * return, since we have to check for the extreme values for mProbability
  298. * anyway, and |trial| should never return true at all when mProbability is 0.
  299. */
  300. bool chooseSkipCount() {
  301. /*
  302. * If the probability is 1.0, every call to |trial| returns true. Make sure
  303. * mSkipCount is 0.
  304. */
  305. if (mProbability == 1.0) {
  306. mSkipCount = 0;
  307. return true;
  308. }
  309. /*
  310. * If the probabilility is zero, |trial| never returns true. Don't bother us
  311. * for a while.
  312. */
  313. if (mProbability == 0.0) {
  314. mSkipCount = SIZE_MAX;
  315. return false;
  316. }
  317. /*
  318. * What sorts of values can this call to std::floor produce?
  319. *
  320. * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log
  321. * returns a value in the range [-infinity, -2**-53], all negative. Since
  322. * mInvLogNotProbability is negative (see its comments), the product is
  323. * positive and possibly infinite. std::floor returns +infinity unchanged.
  324. * So the result will always be positive.
  325. *
  326. * Converting a double to an integer that is out of range for that integer
  327. * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure
  328. * we get an acceptable value for mSkipCount.
  329. *
  330. * The clamp is written carefully. Note that if we had said:
  331. *
  332. * if (skipCount > SIZE_MAX)
  333. * skipCount = SIZE_MAX;
  334. *
  335. * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to
  336. * double is 2^64, not 2^64-1, so this doesn't actually set skipCount to a
  337. * value that can be safely assigned to mSkipCount.
  338. *
  339. * Jakub Oleson cleverly suggested flipping the sense of the comparison: if
  340. * we require that skipCount < SIZE_MAX, then because of the gaps (2048)
  341. * between doubles at that magnitude, the highest double less than 2^64 is
  342. * 2^64 - 2048, which is fine to store in a size_t.
  343. *
  344. * (On 32-bit machines, all size_t values can be represented exactly in
  345. * double, so all is well.)
  346. */
  347. double skipCount = std::floor(std::log(mGenerator.nextDouble())
  348. * mInvLogNotProbability);
  349. if (skipCount < SIZE_MAX)
  350. mSkipCount = skipCount;
  351. else
  352. mSkipCount = SIZE_MAX;
  353. return true;
  354. }
  355. };
  356. } /* namespace mozilla */
  357. #endif /* mozilla_FastBernoulliTrial_h */