-
-
Notifications
You must be signed in to change notification settings - Fork 344
Make author and committer date roundtrip #1935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d190af2
to
47db249
Compare
In terms of performance, the micro benchmark result looks good:
For the macro benchmark, the wall time of the entire things is similar/slightly improved, but
So if I read this correctly, traversing the graph went from 10.83s down to 10.55s but estimate hours went from 28ms to 190ms. I would read this as an acceptable trade-off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for making this happen, and apologies for the late response.
I did play with ein t hours
and profiled it. There it showed that the reason for the slowdown seems to be the unicode-space splitting. Doing this…
diff --git a/gix-actor/src/lib.rs b/gix-actor/src/lib.rs
index 67ce39504..23b9a3f2f 100644
--- a/gix-actor/src/lib.rs
+++ b/gix-actor/src/lib.rs
@@ -98,8 +98,9 @@ pub struct SignatureRef<'a> {
impl SignatureRef<'_> {
/// Seconds since unix epoch from the time
pub fn seconds(&self) -> gix_date::SecondsSinceUnixEpoch {
+ use winnow::stream::AsChar;
self.time
- .split(u8::is_ascii_whitespace)
+ .split(|b| b.is_space())
.next()
.and_then(|i| i.to_str().ok())
.and_then(|s| s.parse().ok())
Seemed to not make a significant difference, even though I also didn't see it in the profile anymore.
This probably doesn't matter though as I also realized that turning time: gix_date::Time
into a BString
feels wrong, besides being a worst-case change for the downstream which at this point we have to consider.
I also doubt that this will be it, i.e. there is already some special casing of user names and emails, and flexible parsing of whitespace, which also wouldn't be restored correctly.
Hence I'd think that if it was parsed, there should be some sort of raw: Option<BString>
field that contains the entire line. Only then round-tripping will work for sure, at least with the signature field.
Doing so will open up gates to hell when there is mutable fields and a raw
field that go out of sync, when present.
This makes me question the whole approach, and I'd prefer to say that once the user has a non *Ref
version of something, it won't round-trip. Or said differently, once the user has potentially changed an object, it will be written 'cleanly' without extra spaces and whatever else was leniently parsed.
This makes me want to understand the use-case more. When does one decode and re-encode an object, and expect it to be the same byte for byte? During FSCK? Fair, but then it should actually be interesting that it's non-conforming and one could show a diff with the lines that don't match anymore.
Is it cherry-picking? Then I'd understand why we are specifically interested in the signatures, even though I wonder if there is much value in not writing it out verbatim, and not a 'cleaned' version, with non-significant bytes removed.
While I understand in theory, in practice, I think this may be enough for the signature field.
Just to be sure I understand your suggestion precisely, are you advocating for having I see two potential issues with this:
In Mononoke, we implemented a "Git server" that's not backed by the file system, but by a "blobstore" (conceptually, db storage of bytes). It speaks the Git protocol. There are two places we get Git bytes:
It's a choice that may not be absolutely set-in-stone, but we tend to use the owned version of Anyway, to get back to the issue at hand, I think we should maintain round-tripping between FYI, I'm currently tackling the long tail of repos that have caused issues (failure for us to import them in Mononoke or to clone them from Mononoke) out of >7000 repos, only ~30 couldn't be imported yet, and with the other PRs I made and this one, there is a single issue left that I'm aware of and hasn't been adressed yet (a strange way to encode the PGP signature. PR soon). So at least experimentally, we're not too far from handling every edge case. Once all this code is upstream, I'll make sure we can clone each repo again, to catch any regression or omission. |
47db249
to
4701f25
Compare
This push was just rebasing and applying the suggestion above:
|
Sorry for the late reply, I feel off a cliff there, unrelated to this issue which, admittedly, is among the big 'small' problems to solve. However, I think it will also help to define a boundary for what round-tripping means, and to what extend that's possible. Just as a model I have in my head, we have
If we break downstream to support more cases of round-tripping, I'd want to do it so that it's done for good. Looking at However,
It's true, even though I think it's more subtle as the difference is already in the types themselves.
They live in the original byte buffer.
I see, it's really valuable to see where the requirement is coming from, and I get why it currently is done as it is.
Thanks, that's cool :)! And it's even open-source last time I checked.
Here I don't understand is why objects are decoded, without mutating them, to then be re-serialized in order to obtain their hash.
I agree, and right now it seems that
I agree.
I am excited about this! For this PR, I think the problem is that it feels too wrong to have I did my best but apologize if some of what I wrote is off or incoherent, I had like 10 interruptions 😅👧.
This makes me hopeful. Maybe you can spike a case of mononoke parallelism which doesn't pass |
No worries at all. I really appreciate the time you are taking to very thoroughly weigh our options and to explain the wider vision so I can fumble in the dark a little less :) Let me suggest 3 possible way forwards, and I will follow your lead on which one to implement: Option 1:
|
Thanks for the analysis and for deriving options! Some of these I implicitly discarded, but it's great to see them written out. Bringing up the Thus, let's go ahead with Option 3. I'd still expect that Lastly, if you wanted to truly gift Thanks again, it's my pleasure to work with you. |
7709926
to
f686874
Compare
I went ahead and implemented it in an extra commit on top for ease of review. I can go ahead and improve the history later.
I can follow-up with this in a separate PR once we've got this one finalized. Would you like me to open an issue so we don't forget? |
It seems the fuzzer like me is not a fan of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sketching this! And yes, let's keep committing on top.
I can follow-up with this in a separate PR once we've got this one finalized. Would you like me to open an issue so we don't forget?
I added some tasks for myself in the PR text and will just take this on myself before merging.
I think everything looks good and ein t hours
as well as gix-mailmap
are good examples for how the downstream could adjust to these changes without losing anything.
Lastly, the only change needed here is to make to_owned()
fallible - that should also fix the fuzzer. Once done, I will take it from there.
If you won't get to it within a day or so please let me know and I can finish the PR from here.
Looking forward to merging! This PR affected gitoxide
a lot as it now finally became clear how to handle real-world garbage (*Ref
) while also assuring users will produce only proper objects when creating them (*[NonRef]
).
gix-actor/src/signature/mod.rs
Outdated
@@ -18,7 +19,7 @@ mod _ref { | |||
Signature { | |||
name: self.name.to_owned(), | |||
email: self.email.to_owned(), | |||
time: self.time.to_owned(), | |||
time: Time::from_bytes(self.time).expect("Time must be valid"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_owned()
would definitely need to be fallible now, time
can be anything.
gitoxide-core/src/hours/mod.rs
Outdated
@@ -82,7 +82,8 @@ where | |||
}; | |||
let name = string_ref(author.name.as_ref()); | |||
let email = string_ref(author.email.as_ref()); | |||
let time = string_ref(author.time.as_ref()); | |||
let mut buf = Vec::with_capacity(64); | |||
let time = string_ref(author.time.to_ref(&mut buf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a fix for this, leading to these numbers:
❯ ein t hours
08:25:34 traverse commit graph done 1.3M commits in 9.12s (143.6K commits/s)
08:25:34 estimate-hours Extracted and organized data from 1309152 commits in 7.67ms (170684736 commits/s)
total hours: 1243723.88
total 8h days: 155465.48
total commits = 1309152
total authors: 34641
total unique authors: 26510 (23.47% duplication)
linux ( master) +408 -798 [!⇣14583] took 9s
❯ /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/ein t hours
08:26:42 traverse commit graph done 1.3M commits in 7.84s (167.1K commits/s)
08:26:43 estimate-hours Extracted and organized data from 1309152 commits in 39.432084ms (33200174 commits/s)
total hours: 1243723.88
total 8h days: 155465.48
total commits = 1309152
total authors: 34641
total unique authors: 26510 (23.47% duplication)
linux ( master) +408 -798 [!⇣14583] took 8s
❯ /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/ein t hours
08:32:59 traverse commit graph done 1.3M commits in 7.73s (169.4K commits/s)
08:32:59 estimate-hours Extracted and organized data from 1309152 commits in 5.423541ms (241383248 commits/s)
total hours: 1243723.88
total 8h days: 155465.48
total commits = 1309152
total authors: 34641
total unique authors: 26510 (23.47% duplication)
gix-actor/src/signature/mod.rs
Outdated
|
||
impl Signature { | ||
/// Borrow this instance as immutable | ||
pub fn to_ref(&self) -> SignatureRef<'_> { | ||
pub fn to_ref<'a>(&'a self, buf: &'a mut Vec<u8>) -> SignatureRef<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely annoying, but I wouldn't know what else to do either.
Since we'd probably have to live with that, buf
should certainly be documented as it may not be obvious to the caller why it's needed.
let signature: SignatureRef = gix_actor::SignatureRef::from_bytes::<()>(input).unwrap(); | ||
let mut output = Vec::new(); | ||
signature.write_to(&mut output)?; | ||
assert_eq!(output.as_bstr(), input.as_bstr()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to have another assertion that shows that Signature
does not round-trip nor is it able to represent all of SignatureRef
.
gix-actor/src/signature/mod.rs
Outdated
@@ -61,16 +63,10 @@ mod convert { | |||
Signature { | |||
name: name.to_owned(), | |||
email: email.to_owned(), | |||
time: time.to_owned(), | |||
time: Time::from_bytes(time).expect("Time must be valid"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This From
implementation would also have to be removed now, but should probably be replaced with TryFrom
.
gix-date/src/lib.rs
Outdated
@@ -44,6 +44,11 @@ impl Time { | |||
let s = i.as_bstr().to_str().expect("Input must be ascii"); | |||
parse_raw(s).ok_or(Error::InvalidDateString { input: s.into() }) | |||
} | |||
/// Write time into buffer | |||
pub fn to_ref<'a>(&self, buf: &'a mut Vec<u8>) -> &'a BStr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buf
should probably be explained in the docs.
As always, thanks a lot for the thorough review and thoughtful feedback. I'm glad we're progressing in discovering a design that makes sense for gitoxide through fixing this particular issue.
Yes, that makes sense to me. Will make the fuzzer happy and will make me happy too. Then, the contract will be clear for callers. Without calling them out individually, the other review comments make sense to me too.
Unfortunately, for the rest of the week, I'm not going to be able to dedicate much time to it as I am working synchronously on something else with my team. I can get back on it next Monday if you don't get to it first. |
Just FYI, I make good progress and plan to merge this PR today. |
`gix_date::Time` can be parsed in two ways: * From raw bytes (in the data format Git adheres too) * From the date format used in gix config Add utility constructors for these two cases. When parsing from raw bytes, also support a format like `<sign><HH><MM><SS>` since this is a situation that happens in the wild and that Git can handle.
By storing the raw bytes from Git instead of parsing them into a `gix_date::Time` from the start, we are able to roundtrip between Git and gix-actor even with creatively formatted Git data. Also add a test case that shows a time offset seen in the wild which would have failed to round-trip without this change.
Use the committer date and author date that are now backed by bytes and interpret these bytes into a `gix_date::Time` on demand.
We now store a gix-date::Time in the owned `Signature` and a `&BStr` in `SignatureRef`. This means that `SignatureRef` round-trips to and from Git, but `Signature` doesn't necessarily round-trip with `SignatureRef`. It also means that we need a buffer of bytes when creating a `SignatureRef`.
* use `gix_date::Time` instead of buffer-based time. ``` ❯ ein t hours 08:25:34 traverse commit graph done 1.3M commits in 9.12s (143.6K commits/s) 08:25:34 estimate-hours Extracted and organized data from 1309152 commits in 7.67ms (170684736 commits/s) total hours: 1243723.88 total 8h days: 155465.48 total commits = 1309152 total authors: 34641 total unique authors: 26510 (23.47% duplication) linux ( master) +408 -798 [!⇣14583] took 9s ❯ /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/ein t hours 08:26:42 traverse commit graph done 1.3M commits in 7.84s (167.1K commits/s) 08:26:43 estimate-hours Extracted and organized data from 1309152 commits in 39.432084ms (33200174 commits/s) total hours: 1243723.88 total 8h days: 155465.48 total commits = 1309152 total authors: 34641 total unique authors: 26510 (23.47% duplication) linux ( master) +408 -798 [!⇣14583] took 8s ❯ /Users/byron/dev/github.com/GitoxideLabs/gitoxide/target/release/ein t hours 08:32:59 traverse commit graph done 1.3M commits in 7.73s (169.4K commits/s) 08:32:59 estimate-hours Extracted and organized data from 1309152 commits in 5.423541ms (241383248 commits/s) total hours: 1243723.88 total 8h days: 155465.48 total commits = 1309152 total authors: 34641 total unique authors: 26510 (23.47% duplication) ```
- find a way to reasonably keep previous semantics in downstream users
The contained `time` field is now a string, which has to be parsed into a time for conversion to an owned type. Additionally, replace `From<SignatureRef> for Signature` with `TryFrom`.
We also add a `gix_date::Time::to_str()` method, along with related utilities, to be able to turn a parsed time back into a raw buffer conveniently. Further, remove `Time::to_bstring()` in favor of a `Display` implementation.
…trip. Round-tripping is now handled in the `gix-actor` crate using the `SignatureRef` type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think this will be it!
I recommend taking a look at the changes to gix-actor
and gix-date
just to catch up with my changes. gix-date
should be more conformant with Git now, and overall I am very happy with how it turned out.
Awesome. I had a look and it makes sense based on our discussion. Thanks a lot for carrying this PR over the line :) |
I also hope it works for you in Mononoke! |
Thank you, I agree. I suspect the final state will be quite nice: maintainable and better optimized than the current state. It may take us some time to get there, but I'll happily report back once it's done. |
Store the bytes in raw Git format in the Signature, allowing round-tripping even for creatively formatted git data.
Whenever we need to interpret the time, parse it on demand.
Note that the most common need for interpreting the time is limited to its
seconds
part.We expose a way to parse the
seconds
without parsing the rest to avoid unnecessary costs.Fixes issue 1923
Tasks Byron
gix-date
maybesign
field, while we are breaking things