traced_probes: fix CpuReaders not waking up on non-first tracing session

Symptoms of the issue:
Massive overruns would be observed in the ftrace buffers even though the
buffers are more than big enough for the ftrace flush period for the
volume of data being written. This would only happen on the Android
build but not on the standalone build started with tmux (though this was
a red herring - see below for why).

Upon further investigation, with a program which would context switch
every 1ms (while confined to the BIG cores of the phone), we found that
traced_probes was not being scheduled for up to 12s after the initial
read of the buffers. However, there was plenty of CPU available because
the little cores were essentially free so this was not because of
starvation but rather some logical bug somewhere causing us not to wake
up correctly.

Cause:
This is a subtle bug caused by not resetting the cpus_to_drain field
when we start a tracing session. This means we leak state between an old
tracing session (which likely has the cpus_to_drain field set to some
non-zero value) and the current tracing session.

This is a problem because this variable is checked by the CpuReaders to
decide whether to wake up the main thread to drain the buffer. As every
CpuReader was seeing this non-zero value, everyone thought the flush had
already been scheduled which means no one scheduled the flush.

This leads to the drains being ignored for all of trace; with traced_probes
only waking up if there was a spurious wakeup or similar.

Resolution:
Similar to how flush_acks are reset just before a kFlush is issued in
FtraceController, clear the cpus_to_drain just before a kRun is issued.

This matches the code with the documented behaviour in
ftrace_thread_sync.h which says that we are meant to do this.

Unittest failure:
There were actually 4 bugs in the FtraceControllerTest::BackToBackEnableDisable
1. We were expecting only one Post{Task, DelayedTask} each which is preceisely
the behaviour we are trying to fix by the above change - we should get 2
PostTasks as we have two data sources and the state of the first should not
affect the state of the second.
2. We were not running the DelayedTask even while we were expecting it to be
present.
3. We were not resetting the data available callback between data sources
which is wrong because the callback captures state about the generation which
is out of date for the second data source.
4. We were relying on unspecified behaviour of std::function clearing the task
when we moved from it which is not guaranteed to happen.

Fix all of these bugs which essentially makes this test a correctness check for
the real change.

Interesting notes:
This took 1 day of 3.5 engineers time for a 1 line fix. As usual, the
simple, subtle failures are the hardest to find.

Bug: 130202267
Change-Id: I8f5045adda84b633b80186a1e7080886366cd923
2 files changed
tree: f89880569f19f5916fce93aed9a47e4689f3998a
  1. build_overrides/
  2. buildtools/
  3. debian/
  4. docs/
  5. gn/
  6. include/
  7. infra/
  8. protos/
  9. src/
  10. test/
  11. tools/
  12. ui/
  13. .clang-format
  14. .gitignore
  15. .gn
  16. .travis.yml
  17. Android.bp
  18. Android.bp.extras
  19. BUILD
  20. BUILD.extras
  21. BUILD.gn
  22. codereview.settings
  23. heapprofd.rc
  24. MODULE_LICENSE_APACHE2
  25. NOTICE
  26. OWNERS
  27. perfetto.rc
  28. PRESUBMIT.py
  29. README.chromium
  30. README.md
  31. TEST_MAPPING
README.md

Perfetto - Performance instrumentation and tracing

Perfetto is an open-source project for performance instrumentation and tracing of Linux/Android/Chrome platforms and user-space apps.

See www.perfetto.dev for docs.

Bugs

  • For bugs affecting Android or the tracing internals use the internal bug tracker (go/perfetto-bugs).
  • For bugs affecting Chrome use http://crbug.com, Component:Speed>Tracing label:Perfetto.