Failure to decode Python 2.7 Code Object #1

landaire · 2020-12-11T00:07:54Z

First off, thanks for doing the hard work for this crate. I have a project where I'd like to just read data from a marshalled Python object and this solution fits the bill.

The code object I'm trying to deserialize is from Python 2.7 which appears to have a slightly different layout for TYPE_CODE. If you try to deserialize this data currently, the code object will advance the stream beyond where it should be and you'll start deserializing incorrect data.

I wrote a patch to add basic support for Python 2.7:

diff --git a/Cargo.toml b/Cargo.toml
index aed85a2..4faa7ce 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -18,3 +18,7 @@ num-bigint = "0.2"
 num-complex = "0.2"
 error-chain = "0.12"
 owning_ref = { version = "0.4", optional = true }
+
+[features]
+default = []
+python27 = []
\ No newline at end of file
diff --git a/src/lib.rs b/src/lib.rs
index 73e9402..a4e6e27 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -115,12 +115,12 @@ pub struct Code {
     pub flags:           CodeFlags,
     pub code:            Arc<Vec<u8>>,
     pub consts:          Arc<Vec<Obj>>,
-    pub names:           Vec<Arc<String>>,
-    pub varnames:        Vec<Arc<String>>,
-    pub freevars:        Vec<Arc<String>>,
-    pub cellvars:        Vec<Arc<String>>,
-    pub filename:        Arc<String>,
-    pub name:            Arc<String>,
+    pub names:           Vec<Arc<Vec<u8>>>,
+    pub varnames:        Vec<Arc<Vec<u8>>>,
+    pub freevars:        Vec<Arc<Vec<u8>>>,
+    pub cellvars:        Vec<Arc<Vec<u8>>>,
+    pub filename:        Arc<Vec<u8>>,
+    pub name:            Arc<Vec<u8>>,
     pub firstlineno:     u32,
     pub lnotab:          Arc<Vec<u8>>,
 }
@@ -920,12 +920,12 @@ pub mod read {
                 re: r_float_bin(p)?,
                 im: r_float_bin(p)?,
             })),
-            Type::String => Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?))),
+            Type::String => { let obj = Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?))); println!("{:?}", obj); obj },
             Type::AsciiInterned | Type::Ascii | Type::Interned | Type::Unicode => {
-                Some(Obj::String(Arc::new(r_string(r_long(p)? as usize, p)?)))
+                Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?)))
             }
             Type::ShortAsciiInterned | Type::ShortAscii => {
-                Some(Obj::String(Arc::new(r_string(r_byte(p)? as usize, p)?)))
+                Some(Obj::Bytes(Arc::new(r_bytes(r_byte(p)? as usize, p)?)))
             }
             Type::SmallTuple => Some(Obj::Tuple(Arc::new(r_vec(r_byte(p)? as usize, p)?))),
             Type::Tuple => Some(Obj::Tuple(Arc::new(r_vec(r_long(p)? as usize, p)?))),
@@ -948,8 +948,8 @@ pub mod read {
             Type::Dict => Some(Obj::Dict(Arc::new(RwLock::new(r_hashmap(p)?)))),
             Type::Code => Some(Obj::Code(Arc::new(Code {
                 argcount: r_long(p)?,
-                posonlyargcount: if p.has_posonlyargcount { r_long(p)? } else { 0 },
-                kwonlyargcount: r_long(p)?,
+                posonlyargcount: if cfg!(not(feature = "python27")) && p.has_posonlyargcount { r_long(p)? } else { 0 },
+                kwonlyargcount: if cfg!(not(feature = "python27"))  { r_long(p)? } else { 0 },
                 nlocals: r_long(p)?,
                 stacksize: r_long(p)?,
                 flags: CodeFlags::from_bits_truncate(r_long(p)?),
@@ -996,9 +996,9 @@ pub mod read {
     fn r_object_not_null(p: &mut RFile<impl Read>) -> Result<Obj> {
         Ok(r_object(p)?.ok_or(ErrorKind::IsNull)?)
     }
-    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<String>> {
+    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<Vec<u8>>> {
         Ok(r_object_not_null(p)?
-            .extract_string()
+            .extract_bytes()
             .map_err(ErrorKind::TypeError)?)
     }
     fn r_object_extract_bytes(p: &mut RFile<impl Read>) -> Result<Arc<Vec<u8>>> {
@@ -1011,15 +1011,15 @@ pub mod read {
             .extract_tuple()
             .map_err(ErrorKind::TypeError)?)
     }
-    fn r_object_extract_tuple_string(p: &mut RFile<impl Read>) -> Result<Vec<Arc<String>>> {
+    fn r_object_extract_tuple_string(p: &mut RFile<impl Read>) -> Result<Vec<Arc<Vec<u8>>>> {
         Ok(r_object_extract_tuple(p)?
             .iter()
             .map(|x| {
                 x.clone()
-                    .extract_string()
+                    .extract_bytes()
                     .map_err(|o: Obj| Error::from(ErrorKind::TypeError(o)))
             })
-            .collect::<Result<Vec<Arc<String>>>>()?)
+            .collect::<Result<Vec<Arc<Vec<u8>>>>>()?)
     }
 
     fn read_object(p: &mut RFile<impl Read>) -> Result<Obj> {

There's a lot going on here, but the basic change is this:

-                posonlyargcount: if p.has_posonlyargcount { r_long(p)? } else { 0 },
-                kwonlyargcount: r_long(p)?,
+                posonlyargcount: if cfg!(not(feature = "python27")) && p.has_posonlyargcount { r_long(p)? } else { 0 },
+                kwonlyargcount: if cfg!(not(feature = "python27"))  { r_long(p)? } else { 0 },

This on its own is not enough, however. At least for the object I'm trying to deserialize, things break again once the filename field is deserialized. r_object is invoked again to read the filename string, and then you hit this block:

Type::String => Some(Obj::Bytes(Arc::new(r_bytes(r_long(p)? as usize, p)?))),

Note the change to r_object_extract_string:

-    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<String>> {
+    fn r_object_extract_string(p: &mut RFile<impl Read>) -> Result<Arc<Vec<u8>>> {
         Ok(r_object_not_null(p)?
-            .extract_string()
+            .extract_bytes()
             .map_err(ErrorKind::TypeError)?)
     }

Type::String currently returns an Obj::Bytes but extract_string expects an Obj::String. I'm not sure what the right approach is here. I ended up just changing all of the strings to be Vec<u8>s since, at least in my case, the strings may contain wildly invalid UTF-8.

I wanted to ask: are you interested in supporting Python 2.7 deserialization? I wouldn't mind cleaning up this patch a bit and adding tests. I would need to study the Python 2.7 marshal code a bit more though to understand what the right approach here is for handling these strings correctly.

The text was updated successfully, but these errors were encountered:

sollyucko · 2021-02-01T03:57:34Z

I am so sorry for not noticing this until now, but if you're still interested...

Rather than adding a feature flag, I feel like it would make sense to just use the has_posonlyargcount in-code configuration option, and add another has_kwonlyargcount option, which are designed to accommodate for differences between Python versions. (The way they work is that if you use marshal_load_ex, you can specify a MarshalLoadExOptions struct with the configuration options describing the behavior of the relevant Python version.)

Re: strings vs bytes: any idea what encoding the strings you are encountering are in? Latin-1? Or something else? If the encoding is predictable, would converting to UTF-8 AKA String be okay? If so, the code for creating the Obj::Code (https://github.com/sollyucko/py-marshal/blob/master/src/lib.rs#L949-L966) could probably be changed to automatically convert a Obj::Bytes from the correct encoding into a String.

FYI: I'm happy to support as wide of a range of versions and use-cases as possible. However, I'm not really working on this much anymore myself, so feel free to submit a PR.

landaire · 2021-02-01T05:15:46Z

I am so sorry for not noticing this until now, but if you're still interested...

No worries, I know how it goes :)

For full context, I started using this crate as I didn't want to write my project in Python. I also was not aware of how different the Python 2.7 and Python 3 marshal format was. As I've worked on the project since creating this issue some other changes have been made. You can view them here: master...landaire:master

Note the change looks a lot larger than it really is since somewhere along the way I had energy to add support for serializing data, refactored some things, then quickly lost that energy. Here's a rough list of major changes:

String refs appear to be handled differently in Python 2
The types that can be deserialized may have a different marker (787c324)
This isn't committed and pushed but I ended up changing strings to be based on https://docs.rs/bstr/0.2.14/bstr/ as a cheap workaround

Rather than adding a feature flag, I feel like it would make sense to just use the has_posonlyargcount in-code configuration option, and add another has_kwonlyargcount option, which are designed to accommodate for differences between Python versions. (The way they work is that if you use marshal_load_ex, you can specify a MarshalLoadExOptions struct with the configuration options describing the behavior of the relevant Python version.)

This makes sense.

Re: strings vs bytes: any idea what encoding the strings you are encountering are in?

They should be UTF-8, but I think the problem is that in Python 2.7 strings aren't guaranteed to be UTF-8 -- I think they're just arbitrary bytes but I could be wrong. As I eluded to above, I ended up just using the bstr crate for Strings/Bytes -- and this has worked out pretty well. I'll push this change soon.

FYI: I'm happy to support as wide of a range of versions and use-cases as possible. However, I'm not really working on this much anymore myself, so feel free to submit a PR.

Totally understand. My work on the project has been sort of off-and-on, so when I get a chance I'll package things up and submit a PR. Hopefully you don't feel any pressure -- I've been moving along just fine with a git source specified for the crate.

sollyucko · 2021-02-01T15:55:22Z

Things I'm fine with:

Changing error-chain to thiserror
Renaming Result to ParseResult
Changing String to BString: I guess this is fine, due to Python 2's use of non-strict UTF-8
Moving mod read to its own file: Great idea, I was even thinking about this myself
Adding Python 2 type characters

Things I'm not fine with:

Removing Python 3 type characters: Is there any reason this is necessary?
Adding the fictitious Bytes type: Shouldn't the mapping be Obj::Bytes => Type::String and Obj::String => Type::Unicode?
Removing posonlyargcount and kwonlyargcount: Is there any reason these can't just be 0 in Python 2?
Modifying existing test data to remove the posonlyargcount and kwonlyargcount fields: I'd rather have multiple tests using different MarshalLoadExOptions values

AFAICT from my testing:

Python 2 displays str as UTF-8 and marshals it as s + possibly-invalid UTF-8
Python 2 displays unicode as UCS-1/2/4 (UCS-4 = UTF-32) and marshals it as u + UTF-8
Python 3 marshals bytes as s
Python 3 displays str as characters rather than escapes and marshals it as u/z (ignoring interning and FLAG_REF) + UTF-8
Basically, str/bytes is marshalled as s, unicode/str is typically marshalled as u, and everything is marshalled using UTF-8. However, Python 2 isn't strict about having proper UTF-8.

To summarize: I'm happy to support Python 2, but without removing support for Python 3. I'm fine with making breaking changes if needed, though.

sollyucko mentioned this issue Apr 10, 2023

add Bytes to ObjHashable #9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to decode Python 2.7 Code Object #1

Failure to decode Python 2.7 Code Object #1

landaire commented Dec 11, 2020

sollyucko commented Feb 1, 2021

landaire commented Feb 1, 2021

sollyucko commented Feb 1, 2021

Failure to decode Python 2.7 Code Object #1

Failure to decode Python 2.7 Code Object #1

Comments

landaire commented Dec 11, 2020

sollyucko commented Feb 1, 2021

landaire commented Feb 1, 2021

sollyucko commented Feb 1, 2021